About linux memory management

Содержание

Linux memory management
Memory mapping in top: VIRT, RES and SHR
Conclusions
Memory management в ядре Linux. Семинар в Яндексе
Задачи подсистемы управления памятью и компоненты, из которых она состоит
Аппаратные возможности платформы x86_64
Как описывается в ядре физическая и виртуальная память?
API подсистемы управления памятью
Высвобождение ранее занятой памяти (memory reclaim)
Модель «LRU»
Инструменты мониторинга
Memory cgroups

Linux memory management

Image from http://www.linuxatemyram.com/

I think that is a common question for every Linux user soon or later in their career of desktop or server administrator “Why Linux uses all my Ram while not doing much ?”. To this one today I’ve add another question that I’m sure is common for many Linux system administrator “Why the command free show swap used and I’ve so much free Ram ?”, so from my study of today on SwapCached i present to you some useful, or at least i hope so, information on the management of memory in a Linux system.

Linux has this basic rule: a page of free RAM is wasted RAM. RAM is used for a lot more than just user application data. It also stores data for the kernel itself and, most importantly, can mirror data stored on the disk for super-fast access, this is reported usually as “buffers/cache”, “disk cache” or “cached” by top . Cached memory is essentially free, in that it can be replaced quickly if a running (or newly starting) program needs the memory.

Keeping the cache means that if something needs the same data again, there’s a good chance it will still be in the cache in memory.

So as first thing in your system you can use the command free to get a first idea of how is going the use of your RAM.

This is the output on my old laptop with Xubuntu:

# free total used free shared buffers cached Mem: 1506 1373 133 0 40 359 -/+ buffers/cache: 972 534 Swap: 486 24 462

The -/+ buffers/cache line shows how much memory is used and free from the perspective of the applications. In this example 972 MB of RAM are used and 534 MB are available for applications.
Generally speaking, if little swap is being used, memory usage isn’t impacting performance at all.

But if you want to get some more information about your memory the file you must check is /proc/meminfo, this is mine on Xubuntu 12.04 with a 3.2.0-25-generic Kernel:

# cat /proc/meminfo MemTotal: 1543148 kB MemFree: 152928 kB Buffers: 41776 kB Cached: 353612 kB SwapCached: 8880 kB Active: 629268 kB Inactive: 665188 kB Active(anon): 432424 kB Inactive(anon): 474704 kB Active(file): 196844 kB Inactive(file): 190484 kB Unevictable: 160 kB Mlocked: 160 kB HighTotal: 662920 kB HighFree: 20476 kB LowTotal: 880228 kB LowFree: 132452 kB SwapTotal: 498684 kB SwapFree: 470020 kB Dirty: 44 kB Writeback: 0 kB AnonPages: 891472 kB Mapped: 122284 kB Shmem: 8060 kB Slab: 56416 kB SReclaimable: 44068 kB SUnreclaim: 12348 kB KernelStack: 3208 kB PageTables: 10380 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 1270256 kB Committed_AS: 2903848 kB VmallocTotal: 122880 kB VmallocUsed: 8116 kB VmallocChunk: 113344 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 4096 kB DirectMap4k: 98296 kB DirectMap4M: 811008 kB

MemTotal and MemFree are easily understandable for everyone, these are some of the other values:

Cached
The Linux Page Cache (“Cached:” from meminfo ) is the largest single consumer of RAM on most systems. Any time you do a read() from a file on disk, that data is read into memory, and goes into the page cache. After this read() completes, the kernel has the option to simply throw the page away since it is not being used. However, if you do a second read of the same area in a file, the data will be read directly out of memory and no trip to the disk will be taken. This is an incredible speedup and is the reason why Linux uses its page cache so extensively: it is betting that after you access a page on disk a single time, you will soon access it again.

dentry/inode caches
Each time you do an ‘ls’ (or any other operation: open(), stat(), etc…) on a filesystem, the kernel needs data which are on the disk. The kernel parses these data on the disk and puts it in some filesystem-independent structures so that it can be handled in the same way across all different filesystems. In the same fashion as the page cache in the above examples, the kernel has the option of throwing away these structures once the ‘ls’ is completed. However, it makes the same bets as before: if you read it once, you’re bound to read it again. The kernel stores this information in several “caches” called the dentry and inode caches. dentries are common across all filesystems, but each filesystem has its own cache for inodes.

This ram is a component of “Slab:” in meminfo

You can view the different caches and their sizes by executing this command:

head -2 /proc/slabinfo; cat /proc/slabinfo | egrep dentry\|inode

Buffer Cache
The buffer cache (“Buffers:” in meminfo) is a close relative to the dentry/inode caches. The dentries and inodes in memory represent structures on disk, but are laid out very differently. This might be because we have a kernel structure like a pointer in the in-memory copy, but not on disk. It might also happen that the on-disk format is a different endianness than CPU.

Memory mapping in top: VIRT, RES and SHR

When you are running top there are three fields related to memory usage. In order to assay your server memory requirements you have to understand their meaning.

VIRT stands for the virtual size of a process, which is the sum of memory it is actually using, memory it has mapped into itself (for instance the video cards’s RAM for the X server), files on disk that have been mapped into it (most notably shared libraries), and memory shared with other processes. VIRT represents how much memory the program is able to access at the present moment.

RES stands for the resident size, which is an accurate representation of how much actual physical memory a process is consuming. (This also corresponds directly to the %MEM column.) This will virtually always be less than the VIRT size, since most programs depend on the C library.

SHR indicates how much of the VIRT size is actually sharable (memory or libraries). In the case of libraries, it does not necessarily mean that the entire library is resident. For example, if a program only uses a few functions in a library, the whole library is mapped and will be counted in VIRT and SHR, but only the parts of the library file containing the functions being used will actually be loaded in and be counted under RES.

Now we have seen some information on our RAM, but what happens when there is no more free RAM? If I have no memory free, and I need a page for the page cache, inode cache, or dentry cache, where do I get it?

First of all the kernel tries not to let you get close to 0 bytes of free RAM. This is because, to free up RAM, you usually need to allocate more. This is because our Kernel need a kind of “working space” for its own housekeeping, and so if it arrives to zero free RAM it cannot do anything more.

Based on the amount of RAM and the different types (high/low memory), the kernel comes up with a heuristic for the amount of memory that it feels comfortable with as its working space. When it reaches this watermark, the kernel starts to reclaim memory from the different uses described above. The kernel can get memory back from any of the these.

However, there is another user of memory that we may have forgotten about by now: user application data.
When the kernel decides not to get memory from any of the other sources we’ve described so far, it starts to swap. During this process it takes user application data and writes it to a special place (or places) on the disk, note that this happen not only when RAM go close to become full, but the Kernel can decide to move to swap also some data on RAM that has not be used from some time (see swappiness).
For this reason, even a system with vast amounts of RAM (even when properly tuned) can swap. There are lots of pages of memory which are user application data, but are rarely used. All of these are targets for being swapped in favor of other uses for the RAM.

You can check if swap is used with the command free, the last line of the output show information about our swap space, taking the free I’ve used in the example above:

# free total used free shared buffers cached Mem: 1506 1373 133 0 40 359 -/+ buffers/cache: 972 534 Swap: 486 24 462

We can see that on this computer there are 24 MB of swap used and 462 MB available.

So the mere presence of used swap is not evidence of a system which has too little RAM for its workload, the best way to determine this is to use the command vmstat if you see a lot of pages that are swapped in (si) and out (so) it means that the swap is actively used and that the system is “thrashing” or that it is needing new RAM as fast as it can swap out application data.

This is an output on my gentoo laptop, while it’s idle:

# vmstat 5 5 procs ————memory———- —swap— ——io—- -system— —-cpu—- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 2802448 25856 731076 0 0 99 14 365 478 7 3 88 3 0 0 0 2820556 25868 713388 0 0 0 9 675 906 2 2 96 0 0 0 0 2820736 25868 713388 0 0 0 0 675 925 3 1 96 0 2 0 0 2820388 25868 713548 0 0 0 2 671 901 3 1 96 0 0 0 0 2820668 25868 713320 0 0 0 0 681 920 2 1 96 0

Note that in the output of the free command you have just 2 values about swap: free and used, but there is another important value also for the swap space : Swap cache.

Swap Cache

The swap cache is very similar in concept to the page cache. A page of user application data written to disk is very similar to a page of file data on the disk. Any time a page is read in from swap (“si” in vmstat), it is placed in the swap cache. Just like the page cache, this is a bet on the kernel’s part. It is betting that we might need to swap this page out _again_. If that need arises, we can detect that there is already a copy on the disk and simply throw the page in memory away immediately. This saves us the cost of re-writing the page to the disk.

The swap cache is really only useful when we are reading data from swap and never writing to it. If we write to the page, the copy on the disk is no longer in sync with the copy in memory. If this happens, we have to write to the disk to swap the page out again, just like we did the first time. However, the cost of saving _any_ writes to disk is great, and even with only a small portion of the swap cache ever written to, the system will perform better.

So to know the swap used for real we should subtract to the value of SwapUsed the value of SwapCached , you can find these information in /proc/meminfo

Swappiness

When an application needs memory and all the RAM is fully occupied, the kernel has two ways to free some memory at its disposal: it can either reduce the disk cache in the RAM by eliminating the oldest data or it may swap some less used portions (pages) of programs out to the swap partition on disk. It is not easy to predict which method would be more efficient. The kernel makes a choice by roughly guessing the effectiveness of the two methods at a given instant, based on the recent history of activity.

Before the 2.6 kernels, the user had no possible means to influence the calculations and there could happen situations where the kernel often made the wrong choice, leading to thrashing and slow performance. The addition of swappiness in 2.6 changes this.

Swappiness takes a value between 0 and 100 to change the balance between swapping applications and freeing cache. At 100, the kernel will always prefer to find inactive pages and swap them out; in other cases, whether a swapout occurs depends on how much application memory is in use and how poorly the cache is doing at finding and releasing inactive items.

The default swappiness is 60. A value of 0 gives something close to the old behavior where applications that wanted memory could shrink the cache to a tiny fraction of RAM. For laptops which would prefer to let their disk spin down, a value of 20 or less is recommended.

Conclusions

In this article I’ve put some information that I’ve found useful in my work as system administrator i hope they can be useful to you as well.

Reference
Most of this article is based on the work found on these pages:

Источник

Memory management в ядре Linux. Семинар в Яндексе

Привет! Меня зовут Роман Гущин. В Яндексе я занимаюсь ядром Linux. Некторое время назад я провел для системных администраторов семинар, посвященный общему описанию подсистемы управления памятью в Linux, а также некоторым проблемам, с которыми мы сталкивались, и методам их решения. Большая часть информации описывает «ванильное» ядро Linux (3.10), но некоторая часть специфична для ядра, использующегося в Яндексе. Вполне возможно, семинар окажется интересен не только системным администраторам, но и всем, кто хочет узнать, как в Linux устроена работа с памятью.

Основные темы, затронутые на семинаре:

Задачи и компоненты подсистемы управления памятью;
Аппаратные возможности платформы x86_64;
Как описывается в ядре физическая и виртуальная память;
API подсистемы управления памятью;
Высвобождение ранее занятой памяти;
Инструменты мониторинга;
Memory Cgroups;
Compaction — дефрагментация физической памяти.

Под катом вы найдете более подробный план доклада с раскрытием основных понятий и принципов.

Задачи подсистемы управления памятью и компоненты, из которых она состоит

Основная задача подсистемы — выделение физической памяти ядру и userspace-процессам, а также высвобождение и перераспределение в тех случаях, когда вся память занята.

Основные компоненты:

Buddy allocator занимается менеджментом пула свободной памяти.
Page replacent («LRU» reclaim model) решает, у кого отобрать память, когда закончилась свободная.
PTE management — блок управления таблицами трансляции.
Slub kernel allocator — внутренний ядерный аллокатор.
и др.

Аппаратные возможности платформы x86_64

Схема NUMA подразумевает, что к каждому физическому процессору присоединен некоторый объем памяти, к которому он может обращаться быстрее всего. Обращение к участкам памяти других процессоров происходит значительно медленнее.

Как описывается в ядре физическая и виртуальная память?

Физическая память в ядре описывается тремя структурами: ноды (pg_data_t), зоны (struct zone), страницы (struct page). Виртуальная память у каждого процесса своя и описывается при помощи структуры struct mm_struct. Они, в свою очередь, делятся на регионы (struct vm_area_struct).

API подсистемы управления памятью

Ядро взаимодействует с подсистемой memory management при помощи таких функцций функций, как __get_free_page(), kmalloc(), kfree(), vmalloc(). Они отвечают за выделение свободных страниц, больших и малых участков памяти, а также их высвобождение. Существует целое семейство подобных функций, отличающихся небольшими особенностями, например, будет ли занулена область при высвобождении.

Пользовательские программы взаимодействуют с mm-подсистемой при помощи функций mmap(), munmap(), brk(), mlock(), munlock(). Также есть функции posix_fadvice() и madvice(), которые могут давать ядру «cоветы». Но учитывать их в своих эвристиках оно строго говоря не обязано.

Высвобождение ранее занятой памяти (memory reclaim)

Система всегда старается поддерживать некоторый объем свободной памяти (free pool). Таким образом, память выделяется гораздо быстрее, т.к. не приходится высвобождать ее в тот момент, когда она уже действительно нужна.

Те страницы в памяти, которые используются постоянно (системные библиотеки и т.п), называются working set. Вытеснение их из памяти приводит к замедлению работы всей системы. Общая скорость потребления памяти в системе называется memory pressure. Эта величина может очень сильно колебаться в зависимости от того, насколько загружена система.

Всю незанятую ядром память в системе можно поделить на две части: анонимная память и файловая. Отличаются они тем, что про первую мы точно знаем, что каждый ее кусок соответствует какому-либо файлу, и его можно туда сбросить.

Модель «LRU»

LRU расшифровывается как least recently used. Это абстракция, которая предлагает выкидывать страницы, к которым мы дольше всего не обращались. Реализовать ее в Linux полноценно невозможно, т.к. все что нам известно — было ли когда-либо обращение к той или иной странице. Чтобы как-то отслеживать частоту обращений к страницам используются списки active, inactive и unevictable. В последнем находятся залоченные пользователем страницы, которые не будут выбрасываться из памяти ни при каких условиях.

Существуют четкие правила перемещения между списками inactive и active. Под воздействием memory pressure, страницы из неактивного списка могут быть либо выброшены из памяти, либо перейти в активный. Страницы из активного списка перемещаются в неактивный, если к ним давно не было обращений.

Инструменты мониторинга

Утилита top демонстрирует статистику потребления памяти в системе. Програмка vmtouch — показывает какая часть определенного файла находится в памяти. Исчерпывающую информацию по количеству файловых, активных и неактивных страниц можно найти в /proc/vmstat. Статистика buddy allocator есть в /proc/buddyinfo, а статистика slub allocator, соответственно, в /proc/slabinfo. Часто бывает полезно посмотреть на perf top, где отлично видны все проблемы с фрагментацией.

Memory cgroups

Сигруппы зародились из желания выделить группу из нескольких процессов, объединить их логически и ограничить их суммарное потребление памяти определенным. При этом, если они достигнут своего лимита, память должна высвобождаться именно из выделенного им объема. В этом случае нужно освободить память, принадлежащую именно этой сигруппе (это называется target reclaim). Если в системе просто закончилась память и нужно пополнить free pool — это называется global reclaim. C точки зрения аккаунтинга каждая страница принадлежит только одной сигруппе: той, которая ее первой прочитала.

Источник