The scheduler in linux

Содержание

Принцип работы планировщика задач в Linux
Типы процессов в Linux
Планирование в реальном времени
SCHED_FIFO
SCHED_RR
Обобщение по планированию в реальном времени
Условное планирование
Виртуальное время выполнения
CFS —Абсолютно справедливый планировщик
Почему?
Selecting a Linux I/O Scheduler
5 Answers 5
Search
The Linux Scheduler

Принцип работы планировщика задач в Linux

Планирование – это процесс распределения ресурсов системы для выполнения задач. В статье мы рассмотрим его вариант, в котором ресурсом является одно или несколько ядер процессора, а задачи представлены потоками или процессами, которые нужно выполнить.

Само планирование осуществляется планировщиком, который нацелен:

Максимизировать пропускную способность, то есть количество задач, выполняемых за единицу времени.
Минимизировать время ожидания, то есть время, прошедшее с момента готовности процесса до начала его выполнения.
Минимизировать время ответа, то есть время, прошедшее с момента готовности процесса до завершения его выполнения.
Максимизировать равнодоступность, то есть справедливое распределение ресурсов между задачами.

Если с этими метриками вы не знакомы, то предлагаю просмотреть несколько примеров в другой моей статье (англ.), посвященной алгоритмам планировщика.

Типы процессов в Linux

В Linux процессы делятся на два типа:

Процессы реального времени.
Условные процессы.

Процессы реального времени должны вписываться в границы времени ответа, независимо от загрузки системы. Иначе говоря, такие процессы являются срочными и ни при каких условиях не откладываются.

В качестве примера можно привести процесс переноса, отвечающий за распределение рабочей нагрузки между ядрами ЦПУ.

Условные же процессы не ограничиваются строгими рамками времени ответа и в случае занятости системы могут подвергаться задержкам.

В пример можно привести процесс браузера, который позволяет вам читать эту статью.

У каждого типа процессов есть свой алгоритм планирования. При этом пока есть готовые к выполнению процессы реального времени, выполняться будут они, оставляя условные процессы в ожидании.

Планирование в реальном времени

В случае с планированием в реальном времени используются две политики, SCHED_RR и SCHED_FIFO .

Политика определяет количество выделяемого процессу времени, а также принцип организации очереди на выполнение.

Суть в том, что готовые к выполнению процессы хранятся в очереди, откуда выбираются планировщиком на основе той или иной политики.

SCHED_FIFO

В данной политике планировщик выбирает процесс, ориентируясь на время его поступления (FIFO = первым вошел, первым вышел).

Процесс с политикой планирования SCHED_FIFO может «освободить» ЦПУ в нескольких случаях:

Процесс ожидает, к примеру, операции ввода/вывода, после чего по возвращению в состояние «готов» помещается в конец очереди.
Процесс уступил ЦПУ через системный вызов sched_yield , после чего он тут же возвращается в конец очереди.

SCHED_RR

SCHED_RR подразумевает циклическое планирование.

В этой политике каждый процесс в очереди получает интервал времени (квант) и выполняется в свою очередь (исходя из приоритета) по циклическому принципу.

Для лучшего понимания рассмотрим пример, где в очереди находятся три процесса, A B C , все из которых работают по политике SCHED_RR .

Как показано ниже, каждый процесс получает квант времени и выполняется в свою очередь. После однократного выполнения всех процессов они повторяются в той же последовательности.

Обобщение по планированию в реальном времени

Процесс реального времени может планироваться по двум разным политикам, SCHED_FIFO и SCHED_RR .

Политика влияет на принцип работы очереди и определяет, сколько времени нужно выделить тому или иному процессу.

Условное планирование

Здесь мы знакомимся с Completely Fair Scheduler (CFS, абсолютно справедливый планировщик), представляющим алгоритм планирования условных процессов, появившийся в версии Linux 2.6.23.

Помните метрики планирования, которые мы затронули в начале статьи? Так вот CFS фокусируется на одной из них – он стремится к максимальной равноправности процессов, то есть обеспечивает выделение всем процессам равных квантов времени ЦПУ.

Обратите внимание, что процессы с повышенным приоритетом все равно могут получать на обработку больше времени.

Для лучшего понимания принципа работы CFS нужно познакомиться с новым термином – виртуальное время выполнения ( vruntime ).

Виртуальное время выполнения

Виртуальное время выполнения процесса – это количество времени, потраченного именно на выполнение, без учета любых ожиданий.

Как было сказано, CFS стремится быть максимально справедливым, в связи с чем по очереди планирует готовый к выполнению процесс с минимальным виртуальным временем.

CFS задействует переменные, содержащие максимальное и минимальное виртуальное время выполнения, и чуть позже станет ясно зачем.

CFS —Абсолютно справедливый планировщик

Прежде чем перейти к принципу работы этого алгоритма, нужно понять, какие структуры данных он использует.

CFS задействует красно-черное дерево, представляющее бинарное дерево поиска – то есть добавление, удаление и поиск выполняются за O(logN) , где N выражает количество процессов.

Ключом в этом дереве выступает виртуальное время выполнения процесса. Новые процессы или процесс, возвращающиеся из ожидания в состояние готовности, добавляются в дерево с ключом vruntime = min_vruntime . Это очень важный момент, который позволяет избежать дефицита внимания ЦПУ для старых процессов.

Вернемся к самому алгоритму. В первую очередь он устанавливает себе лимит времени – sched_latency .

В течение этого времени алгоритм стремится выполнить все готовые процессы – N . Это означает, что каждый процесс получит интервал времени равный временному лимиту, поделенному на количество процессов: Qi = sched_latency/N .

Когда процесс исчерпывает свой интервал ( Qi ), алгоритм выбирает в дереве следующий процесс с наименьшим виртуальным временем.

Рассмотрим ситуацию, которая может стать проблематичной для такой схемы работы алгоритма.

Предположим, что алгоритм выбрал лимит времени 48мс при наличии 6 процессов – в этом случае каждый процесс получит на выполнение по 8мс.

Но что произойдет, если система окажется перегружена процессами? Предположим, что лимит времени остается равен 48мс, но теперь у нас 32 процесса. В результате каждый получит уже всего по 1.5мс на выполнение, что приведет к замедлению работы всей системы.

Читайте также: Progtv для windows x64

Почему?

Все дело в переключении контекста, которое подразумевает сохранение состояния процесса или потока с последующим его восстановлением и продолжением выполнения.

Каждый раз, когда процесс исчерпывает свое время на выполнение, и планируется очередной процесс, активируется переключение контекста, которое также занимает некоторое время.

Предположим, что на него уходит 1мс. В первом примере, где каждому процессу у нас отводилось по 8мс, это вполне допустимо. Так мы тратим 1мс на переключение контекста и 7мс на фактическое выполнение процесса.

А вот во втором примере на выполнение каждого процесса останется уже всего по 0.5мс – то есть большая часть времени уходит на переключение контекста, отсюда и проблема с выполнением.

Для того, чтобы исправить ситуацию, мы вводим новую переменную, которая определит минимальную протяженность кванта времени выполнения – min_granularity .

Представим, что min_granularity = 6мс , и вернемся к нашему примеру, где лимит времени равен 48мс при наличии 32 процессов.

С помощью той же формулы, что и прежде, мы получаем по 1.5мс на каждый процесс, но теперь такой вариант не допускается, так как min_granularity задает минимальный квант времени, который должен получить каждый процесс.

В данном случае, где Qi , мы берем Qi равным min_granularity , то есть 6мс, и соответствующим образом изменяем временной лимит. В результате он составит Qi x N = 6мс x 32 = 192мс .

На данный момент отличия между CFS и RR могут оказаться недостаточно наглядны, поскольку они оба определяют временные интервалы и организуют порядок выполнения процессов. Для лучшего обобщения и понимания различий между этими алгоритмами я приведу краткую таблицу:

RR – циклический список	CFS – абсолютно справедливый планировщик
Квант времени статичен и не зависит от количества процессов в системе.	Квант времени динамичен и может изменяться в соответствии с количеством процессов в системе.
По исчерпанию процессом его кванта времени, RR выбирает очередной процесс с наименьшим виртуальным временем из циклического списка.	По исчерпанию процессом его кванта времени, CFS выбирает очередной процесс с наименьшим виртуальным временем из красно-черного дерева.

Надеюсь, что статься помогла вам лучше понять реализацию планирования задач в ядре Linux.

Прошу обратить внимание, что автор оригинальной статьи Eliran поблагодарил читателей за интерес и отдельно пригласил желающих со знанием английского языка в свой блог Coding Kaiser для ознакомления с множеством материалов по смежным и другим интересным темам, а также обмена идеями.

Источник

Selecting a Linux I/O Scheduler

I read that it’s supposedly possible to change the I/O scheduler for a particular device on a running kernel by writing to /sys/block/[disk]/queue/scheduler. For example I can see on my system:

that the default is the completely fair queuing scheduler. What I’m wondering is if there is any use in including all four schedulers in my custom kernel. It would seem that there’s not much point in having more than one scheduler compiled in unless the kernel is smart enough to select the correct scheduler for the correct hardware, specifically the ‘noop’ scheduler for flash based drives and one of the others for a traditional hard drive.

Is this the case?

5 Answers 5

As documented in /usr/src/linux/Documentation/block/switching-sched.txt , the I/O scheduler on any particular block device can be changed at runtime. There may be some latency as the previous scheduler’s requests are all flushed before bringing the new scheduler into use, but it can be changed without problems even while the device is under heavy use.

Ideally, there would be a single scheduler to satisfy all needs. It doesn’t seem to exist yet. The kernel often doesn’t have enough knowledge to choose the best scheduler for your workload:

noop is often the best choice for memory-backed block devices (e.g. ramdisks) and other non-rotational media (flash) where trying to reschedule I/O is a waste of resources
deadline is a lightweight scheduler which tries to put a hard limit on latency
cfq tries to maintain system-wide fairness of I/O bandwidth

The default was anticipatory for a long time, and it received a lot of tuning, but was removed in 2.6.33 (early 2010). cfq became the default some while ago, as its performance is reasonable and fairness is a good goal for multi-user systems (and even single-user desktops). For some scenarios — databases are often used as examples, as they tend to already have their own peculiar scheduling and access patterns, and are often the most important service (so who cares about fairness?) — anticipatory has a long history of being tunable for best performance on these workloads, and deadline very quickly passes all requests through to the underlying device.

Источник

Search

The Linux Scheduler

Last month, we started a new series on Linux kernel internals. In that first part, we looked at how Linux manages processes and why in many ways Linux is better at creating and maintaining processes than many commercial UNIX systems.

This time, we dwell a bit on the subject of scheduling. Much to my surprise, here again, Linux goes the unorthodox way and disregards conventional wisdom in kernel theory. The results are excellent. Let’s see how.

In Linux 2.2. x there are three classes of processes, as can be seen from the data definition for the scheduler (from linux/include/linux/sched.h):

SCHED_OTHER tasks are the normal user tasks (default).

Tasks running in SCHED_FIFO will never be preempted. They will leave the CPU only for waiting sync kernel events or if an explicit sleep or reschedule has been requested from user space.

Tasks running in SCHED_RR are real time (RT), but they will leave the CPU if there is another real-time task in the run queue. So the CPU power will be distributed between all SCHED_RR tasks. If at least one RT task is running, no other SCHED_OTHER task will be allowed to run in any CPU. Each RT task has an rt_priority so the SCHED_RR class will be allowed to distribute the CPU power between all the SCHED_RR tasks at will. The rt_priority of the SCHED_RR class works exactly as the normal priority field for of the SCHED_OTHER (default) class.

Only the root user can change the class of the current task via the sched_setscheduler syscall.

One of the tasks of a kernel is to make sure the system remains firmly under its control even in situations of misbehaving programs. One such misbehaving program might fork too many processes too quickly. Thus, the kernel becomes so busy with itself that it cannot cater to its other responsibilities. I found out that Linux has no limit to how fast user-land programs can spawn children. HP-UX, Solaris and AIX have a limit of one fork per processor tick (called a jiffie under Linux). The patch in Listing 1 (see Resources) will allow a maximum of one fork per jiffie (one jiffie is usually 1/100 second, except on the Alpha architecture where it is 1/1024).

Threads are necessary to allow your process to make use of multiple processors. Linux doesn’t really make any distinction between a process and a thread from a memory management and scheduling point of view. Some operating systems, like Solaris, manage threads within the user process by means of a thread scheduling library. The kernel sees only the process and doesn’t know which thread, if any, is actually executing inside the user process. This saves the kernel from managing lists with thousands of entries for each thread for each process.

Obviously, the threads emulated on the top of one single user process won’t be allowed to run concurrently on SMP, so the user-space approach won’t scale very well on an SMP machine. Threading is strictly necessary only when all threads will be CPU-bound and not mainly I/O-oriented. If all the threads are CPU-bound, you definitely want to be able to scale for SMP.

Using threads only to wait for events is overkill. On the other hand, having threads sleeping is a waste of resources and performance. Almost all subsystems in Linux (such as TCP/IP) offer async event registration. Using async event via the SIGIO signal is similar to IRQ-driven handling.

With the user-space approach, you will at least avoid the TLB (translation lookaside buffer) flushing, as all the threads will share the same address space.

The advantage of having threads managed in user space through the threads library is the kernel will spend the scheduling CPU cost in user space. It is true that in user space, you may choose to implement a very fast round-robin scheduler that may cut down the scheduling cost, compared to the clever (but more expensive, in terms of execution path) Linux scheduler.

Speaking of SMP, as of Linux 2.4 I found there is no way to declare the processor affinity of any given user-space process.

The scheduler could keep track of the CPU affinity declaration of a process, or it could just determine a preferred CPU for a process by itself. The other day, together with Andrea Arcangeli of Italy, I designed the simple kernel patch in Listing 2 (see Resources) that implements processor affinity. Notice that processor affinity makes the most sense when other processes are excluded from running on this CPU. A better way to implement this patch would be to have the system administrator set affinity with an external call like nice .

The 2.2. x SMP kernel scheduler has some bugs that sometimes make it less effective than the UP (UniProcessor) scheduler. Nevertheless, Andrea fixed all such bugs and rewrote the heuristics from scratch and the SMP scheduler gives an impressive SMP improvement under load. The SMP changes are just in the 2.3. x kernels, and I plan to integrate it in 2.2. x also. You can get Andrea’s patch at ftp.suse.com/pub/people/andrea/kernel-patches/my-2.2.12/SMP-scheduler-2_2_11-E. That patch can be used against 2.2.11 and 2.2.12 and speeds up both kernels on SMP systems. The patch was merged into 2.3.15, but not yet in 2.2. x because it’s a performance issue only and not a true bug fix.

The SMP scheduler heuristic mechanism works as a function of (not in any particular order):

the last CPU where the wakenup task was running

the memory management of the task (for optimising kernel-threads reschedule)

the “goodness” of the tasks running on the busy CPUs

the time necessary to invalidate the L2 cache on the running CPU (cacheflush_time)

the average time slice (avg_slice) of the wakenup task (how much time the task runs before returning to sleep)

The algorithm collects the above data and chooses the best CPU on which to reschedule the wakenup task.

There are two paths involved in the Linux scheduler behavior:

schedule : the running/current task is a SCHED_OTHER task that expired its time slice (so the kernel runs a schedule while returning from the timer IRQ for switching to the next running task).

reschedule_idle : a task got a wakeup (usually from an IRQ), and so we try to reschedule such wakenup task in the best CPU by invoking a schedule on it (it’s a kind of controlled schedule).

Both paths share the goodness function. The goodness function can be considered the core of the SMP scheduler. It calculates the “goodness” of a task as a function of the following:

the task currently running

the task that wants to run

the current CPU

A plain schedule only works based on goodness. As you can see in Listing 3, a plain schedule is SMP-aware. The goodness of the potential next task increases if its last CPU is the current CPU.

Nevertheless, reschedule_idle is far more critical for CPU affinity and scheduler latencies; for example, if you comment out reschedule_idle, the scheduler latency will become infinite. Also, reschedule_idle takes care of the cache-flush time and the task average time slice, and this is the truly interesting part of the SMP scheduler. In UP, reschedule_idle is not as interesting as the SMP version. Listing 4 (see Resources) is the reschedule_idle implementation taken from 2.3.26 (soon 2.4).

The final objective of reschedule_idle is to call a schedule on a CPU in order to reschedule the wakenup task in it. We use goodness in reschedule_idle because we want to predict the effect of the future schedule that we’ll send to that CPU. By predicting the effect of the future schedule, we can choose the best CPU to reschedule at wakeup time. This, of course, saves us the trouble of executing on a CPU without the proper TLB settings. If the CPU to reschedule is not the current one, we send a reschedule event via inter-CPU message passing (SMP-IPI interrupt on i386).

To make it very clear: the goodness function is the core of the Linux scheduler and is SMP aware, while reschedule_idle is the core of the clever SMP heuristics.

Linux can only do user preemption. Linus Torvalds, it seems, doesn’t believe in kernel preemption. That’s not as bad as it may seem; all is fine for semaphores. Critical sections protected by semaphores can be preempted at any time, as every contention will end in a schedule and there can’t be any deadlock. However, critical sections protected by fast spin locks or by hand locks cannot be preempted unless we block the timer IRQ. So all spin locks should be IRQ-safe. Also, by avoiding kernel preemption, the kernel becomes more robust and simpler, since a lot of complicated code can be saved this way.

By the way, there are tools to monitor the scheduler latency in order to allow the interested hacker to catch potential code sections that need conditional schedules.

Linux, due to its GPL nature, allows us to do things faster than others, because you can adapt and recompile your own kernel instead of using a standard fit-all kernel. For example, in some popular proprietary operating systems such as Solaris 7, many code sections have been packaged within spin_lock and spin_unlock to make the same code work well in both UP and SMP systems. While vendors of these commercial operating systems tout this as a clear advantage for heavy SMP systems, these locks actually bog down UP systems and simple SMP machines, because the same binary driver must work on both SMP and UP kernels. A spin_unlock is one locked ASM instruction:

and calling such clear-bit instruction through a function pointer is complete overkill .

Linux, on the other hand, has in-line code sections for UP and SMP systems, adapting to the machine it is running on. Therefore, if only a UniProcessor system is hosting the kernel, no time is lost locking a code section that doesn’t need it.

Last but not least, as we saw above, the goodness function makes the SMP scheduler in Linux very clever. Having a clever SMP scheduler is critical for performance. If the scheduler is not SMP-aware, OS theory teaches us that the performance on an SMP machine can be even worse than on a UP machine. (This was happening in 2.2.x without the new heuristics, but has now been fixed.)

That’s it for this month. I hope you gained some deeper understanding of how Linux schedules tasks on UP and SMP systems.

Источник