Что такое linux interrupts

Содержание

Большие потоки трафика и управление прерываниями в Linux
External Interrupts in the x86 system. Part 2. Linux kernel boot options
Boot without any extra options
pci=nomsi
noapic
nolapic
Combinations of options:
Interrupt routing tables and the options «acpi=noirq», «pci=noacpi», «acpi=off»
Сonclusion
Acknowledgments

Большие потоки трафика и управление прерываниями в Linux

В этой заметке я опишу методы увеличения производительности линуксового маршрутизатора. Для меня эта тема стала актуальна, когда проходящий сетевой трафик через один линуксовый маршрутизатор стал достаточно высоким (>150 Мбит/с, > 50 Kpps). Маршрутизатор помимо роутинга еще занимается шейпированием и выступает в качестве файрволла.

Для высоких нагрузок стоит использовать сетевые карты Intel, на базе чипсетов 82575/82576 (Gigabit), 82598/82599 (10 Gigabit), или им подобные. Их прелесть в том, что они создают восемь очередей обработки прерываний на один интерфейс – четыре на rx и четыре на tx (возможно технологии RPS/RFS, появившиеся в ядре 2.6.35 сделают то же самое и для обычных сетевых карт). Также эти чипы неплохо ускоряют обработку трафика на аппаратном уровне.
Для начала посмотрите содержимое /proc/interrupts , в этом файле можно увидеть что вызывает прерывания и какие ядра занимаются их обработкой.

В данном примере используются сетевые карты Intel 82576. Здесь видно, что сетевые прерывания распределены по ядрам равномерно. Однако, по умолчанию так не будет. Нужно раскидать прерывания по процессорам. Чтобы это сделать нужно выполнить команду echo N > /proc/irq/X/smp_affinity , где N это маска процессора (определяет какому процессору достанется прерывание), а X — номер прерывания, виден в первом столбце вывода /proc/interrupts. Чтобы определить маску процессора, нужно возвести 2 в степень cpu_N (номер процессора) и перевести в шестнадцатиричную систему. При помощи bc вычисляется так: echo «obase=16; $[2 ** $cpu_N]» | bc . В данном примере распределение прерываний было произведено следующим образом:

Также, если маршрутизатор имеет два интерфейса, один на вход, другой на выход (классическая схема), то rx с одного интерфейса следует группировать с tx другого интерфейса на одном ядре процессора. Например, в данном случае прерывания 46 (eth0-rx-0) и 59 (eth1-tx-0) были определены на одно ядро.
Еще одним весьма важным параметром является задержка между прерываниями. Посмотреть текущее значение можно при помощи ethtool -c ethN , параметры rx-usecs и tx-usecs. Чем больше значение, тем выше задержка, но тем меньше нагрузка на процессор. Пробуйте уменьшать это значение в часы пик вплоть до ноля.
При подготовке в эксплуатацию сервера с Intel Xeon E5520 (8 ядер, каждое с HyperThreading) я выбрал такую схему распределения прерываний:

/proc/interrupts на этом сервере без нагрузки можно посмотреть тут. Не привожу это в заметке из-за громоздкости

UPD:
Если сервер работает только маршрутизатором, то тюнинг TCP стека особого значения не имеет. Однако есть параметры sysctl, которые позволяют увеличить размер кэша ARP, что может быть актуальным. При проблеме с размером ARP-кэша в dmesg будет сообщение «Neighbour table overflow».
Например:
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096

Описание параметров:
gc_thresh1 — минимальное количество записей, которые должны быть в ARP-кэше. Если количество записей меньше, чем это значение, то сборщик мусора не будет очищать ARP-кэш.
gc_thresh2 — мягкое ограничение количества записей в ARP-кэше. Если количество записей достигнет этого значения, то сборщик мусора запустится в течение 5 секунд.
gc_thresh3 — жесткое ограничение количества записей в ARP-кэше. Если количество записей достигнет этого значения, то сборщик мусора незамедлительно запустится.

Источник

External Interrupts in the x86 system. Part 2. Linux kernel boot options

In the last part we discussed evolution of the interrupt delivery process from the devices in the x86 system (PIC → APIC → MSI), general theory, and all the necessary terminology.

In this practical part we will look at how to roll back to the use of obsolete methods of interrupt delivery in Linux, and in particular we will look at Linux kernel boot options:

Also we will look at the order in which the OS looks for interrupt routing tables (ACPI/MPtable/$PIR) and what the impact is from the following boot options:

You’ve probably used some combination of these options when one of the devices in your system hasn’t worked correctly because of an interrupt problem. We’ll go through these options and find out what they do and how they change the kernel ‘/proc/interrupts’ interface output.

Boot without any extra options

In this article for our interrupt investigation we will be using custom board with the Intel Haswell i7 CPU with the LynxPoint-LP chipset which runs coreboot.

We will be getting information about interrupts in the Linux system through the command:

Here is the output when the kernel was booted without any external options:

File ‘/proc/interrupts’ is the procfs Linux interface to the interrupt subsystem, and it presents a table about the number of interrupts on every CPU core in the system in the following form:

First column: interrupt number
CPUx columns: interrupt counters for every CPU core in the system
Next column: interrupt type:
- IO-APIC-edge — edge-triggered interrupt for the I/O APIC controller
- IO-APIC-fasteoi — level-triggered interrupt for the I/O APIC controller
- PCI-MSI-edge — MSI interrupt
- XT-PIC-XT-PIC — interrupt for the PIC controller (we will see it later)
Last column: device (driver) associated with this interrupt

Everything here is like it is supposed to be in the modern system. For the devices and drivers which support MSI/MSI-X, this is the type of interrupt that they use. The rest of the interrupt routing is done through the APIC controller.

Simplistically, the interrupt routing schematics can be drawn like this: (red lines are active routing paths and black lines are unused routing paths)

A device that supports MSI/MSI-X interrupts should have that particular capability listed in its PCI configuration space.

As an example of that let’s look at a little fragment of the lspci output for the devices that declare they use MSI/MSI-X. In our case it is a SATA controller (interrupt ‘ahci’), two ethernet controllers (interrupts ‘eth58*’ and ‘eth59*’), graphical controller (‘i915’), and two HD Audio controllers (‘snd_hda_intel’).

As we see, all of these devices either have a string «MSI: Enable+» or «MSI-X: Enable+».

Let’s downgrade our system! For a start let’s boot with the kernel option ‘pci=nomsi’.

pci=nomsi

Because of this option MSI interrupts become IO-APIC/XT-PIC depending on the interrupt controller in use.

In this case the priority choice is still modern APIC controller, so the interrupt picture will be:

Output of /proc/interrupts:

As expected, all MSI/MSI-X interrupts have disappeared. Instead of them devices now use interrupts of ‘IO-APIC-fasteoi’ type.

Let us draw our attention to the fact that earlier, before enabling this kernel boot option, each of the ‘eth58’ and ‘eth59’ had nine interrupts! But now each of them has only one interrupt. Recall that without the MSI, one function in the PCI device can have only one interrupt!

Here is a little info from the ‘dmesg’ command about the ethernet controllers’ initialization:

— boot without the ‘pci=nomsi’ option:

— boot with the ‘pci=nomsi’ option:

Because of the decreased number of interrupts per device, enabling this option can lead to a significant performance limitation of the device driver, and that is not even counting that according to the Intel research ‘Reducing Interrupt Latency Through the Use of Message Signaled Interrupts’, MSI interrupts 3 times faster than the IO-APIC interrupts and 5 times faster than the PIC interrupts.

noapic

This option disables I/O APIC. MSI interrupts can still find their way to all of the CPUs, but the rest of interrupts from the devices can go only to CPU0, because PIC is only connected to CPU0. However, LAPIC is working and all other CPUs can still work and handle interrupts.

As we see, all IO-APIC-* interrupts have turned into XT-PIC-XT-PIC, and all of these interrupts have been routed to CPU0 only. MSI interrupts on the other hand have remained unchanged and go to all of the CPUs.

nolapic

This kernel boot option disables LAPIC. MSI interrupts can’t work without LAPIC, and I/O APIC can’t work without LAPIC either. All of the device interrupts can only go to the PIC, and it works with the CPU0 only. And without LAPIC the rest of the CPUs besides CPU0 won’t work.

Output of /proc/interrupts:

Combinations of options:

Actually there is only one combination for the new variant of routing: «noapic pci=nomsi». In this case all interrupts from the devices only go to the CPU0 through the PIC controller. But the LAPIC system is still working, so all the other CPUs can work and handle interrupts.

You cannot combine any other options with «nolapic» since it makes I/O APIC and MSI unaccessible. Therefore, if you’ve ever added Linux kernel boot options like «noapic nolapic» (or the most common case «acpi=off noapic nolapic») it seems like you’ve written some extra letters.

Finally, here is the result of the options «noapic pci=nomsi» to our interrupt routing picture:

And the output of /proc/interrupts is:

Interrupt routing tables and the options «acpi=noirq», «pci=noacpi», «acpi=off»

How does the operating system get information about the device interrupt routing? The BIOS prepares such info for the OS in the form of:

ACPI tables (_PIC/_PRT functions)
_MP_ table (MPtable)
$PIR table
Registers 0x3C/0x3D of the device’s PCI configuration space

It is worth to note for the MSI interrupts declaration that the BIOS doesn’t need to do anything extra (beside declaring the use of the LAPIC): all the aforementioned routing information is needed only for the APIC/PIC interrupt lines.

Tables in the list above are presented in the order of priority. Let’s examine it in detail.

Let’s assume the BIOS has presented all this data and we boot our OS without any extra boot options:

OS finds ACPI tables.
ОS executes ACPI function «_PIC», passing it the argument stating that the boot should happen in APIC mode. Here there is function code that usually saves the chosen mode in a variable (for example, PICM=1).
To access interrupt routing info the OS calls ACPI function «_PRT». This checks the PICM variable and returns routing for the APIC mode case.

In the case when we boot with the option noapic:

OS finds ACPI tables
ОS executes ACPI function «_PIC», passing it the argument stating that the boot should happen in PIC mode. Here there is function code that usually saves the chosen mode in a variable (for example, PICM=0)
To access interrupt routing info the OS calls ACPI function «_PRT». This checks the PICM variable and returns routing for the PIC mode case.

If ACPI tables aren’t present or interrupt routing with ACPI is disabled through the option acpi=noirq or pci=noacpi (or ACPI subsystem is completely disabled with the acpi=off option), then the OS looks for the MPtable (_MP_) to get all the interrupt routing information:

OS can’t find/doesn’t look at the ACPI tables
OS finds MPtable (_MP_)

If ACPI tables aren’t present or interrupt routing with ACPI is disabled through the option acpi=noirq or pci=noacpi (or ACPI subsystem is completely disabled with the acpi=off option), and if the MPtable (_MP_) is not present either (or there is a boot option noapic or nolapic):

OS can’t find/doesn’t look at the ACPI tables
OS can’t find/doesn’t look at the MPtable (_MP_)
OS finds $PIR table

If there is no $PIR table or it is not full, then the OS will look at the registers 0x3C/0x3D of the device’s PCI configuration space to guess interrupt routing.

Here is a picture summarizing all of this:

One should remember that not every BIOS provides all of these three tables (ACPI/MPtable/$PIR), so if you’ve passed an option to your bootloader (e.g. GRUB) that disables the use of ACPI or ACPI and MPtable for the interrupt routing, it is possible that your system won’t boot.

Note 1: In the case when we try to boot in APIC mode with the option ‘acpi=noirq’ and without MPtable present, the picture of interrupts will be like in the case of normal booting with only the ‘noapic’ option. The operating system will go to PIC mode by itself. In the case when you try to boot without any ACPI tables at all (‘acpi=off’) and without MPtable present, then the picture will be like this:

This happens because without the ACPI MADT table (Multiple APIC Description Table) and the necessary info from the MPtable, the operating system doesn’t know APIC identifiers (APIC IDs) for the other CPUs and can’t work with them. But the LAPIC of the main CPU0 works because we haven’t disabled it, and MSI interrupts can still go to it. So the interrupt picture would be:

Note 2: In general, interrupt routing with the use of ACPI in an APIC case should match the interrupt routing with the MPtable. Also, the interrupt routing with the use of ACPI in a PIC case should match the interrupt routing with the $PIR table. Therefore the ‘/proc/interrupts’ output should not differ. But in my investigation I’ve noticed one strange fact. For some reason in the case of interrupt routing through the MPtable there is a cascade interrupt «XT-PIC-XT-PIC cascade» in the output:

It is a little bit strange that it happens like that, but it seems like the kernel source documentation says that it is OK.

Сonclusion

In conclusion we review for one more time the discussed options.

Interrupt controller choice options:

pci=nomsi — MSI interrupts become IO-APIC/XT-PIC depending on the interrupt controller in use.
noapic — Disables I/O APIC. MSI interrupts can still go to all the other CPUs, the rest of the device interrupts can only go to the PIC, and it works with the CPU0 only. But LAPIC still works and other CPUs can work and handle interrupts.
noapic pci=nomsi — All of the device interrupts can only go to the PIC, and it works with the CPU0 only. But LAPIC works and other CPUs can work and handle interrupts.
nolapic — Disables LAPIC. MSI interrupts can’t work without LAPIC, and I/O APIC can’t work without LAPIC. All of the device interrupts can only go to the PIC, and it works with the CPU0 only. And without LAPIC the rest of the CPUs besides CPU0 won’t work.

Interrupt tables priority options:

no options — routing through the APIC with the help of ACPI tables
noapic — routing through the PIC with the help of ACPI tables
acpi=noirq (pci=noacpi/acpi=off) — routing through the APIC with the help of MPtable
acpi=noirq (pci=noacpi/acpi=off) noapic (nolapic) — routing through the PIC with the help of $PIR

In the next part we will look at how coreboot configures the chipset for the interrupt routing.

Acknowledgments

Special thanks to Jacob Garber from the coreboot community for helping me with this article translation

Источник