Содержание

Алгоритмы синхронизации
Алгоритм Петерсона
Алгоритм булочной (Bakery algorithm)
std:: atomic_flag::test_and_set
Parameters
Return value
Example
The Infinite Loop
Tales from a lean programmer.
Test-and-set spinlocks
Introduction
The test-and-test-and-set optimization
Exponential back-off
Pausing and sleeping
Performance comparison
Test-and-set
Смотреть что такое «Test-and-set» в других словарях:

Алгоритмы синхронизации

Алгоритм Петерсона

Первое решение проблемы, удовлетворяющее всем требованиям и использующее идеи ранее рассмотренных алгоритмов, было предложено датским математиком Деккером (Dekker). В 1981 году Петерсон (Peterson) предложил более изящное решение. Пусть оба процесса имеют доступ к массиву флагов готовности и к переменной очередности.

При исполнении пролога критической секции процесс P_i заявляет о своей готовности выполнить критический участок и одновременно предлагает другому процессу приступить к его выполнению. Если оба процесса подошли к прологу практически одновременно, то они оба объявят о своей готовности и предложат выполняться друг другу. При этом одно из предложений всегда следует после другого. Тем самым работу в критическом участке продолжит процесс, которому было сделано последнее предложение.

Давайте докажем, что все пять наших требований к алгоритму действительно удовлетворяются.

Удовлетворение требований 1 и 2 очевидно.

Докажем выполнение условия взаимоисключения методом от противного. Пусть оба процесса одновременно оказались внутри своих критических секций . Заметим, что процесс P_i может войти в критическую секцию , только если ready[1-i] == 0 или turn == i . Заметим также, что если оба процесса выполняют свои критические секции одновременно, то значения флагов готовности для обоих процессов совпадают и равны 1 . Могли ли оба процесса войти в критические секции из состояния, когда они оба одновременно находились в процессе выполнения цикла while ? Нет, так как в этом случае переменная turn должна была бы одновременно иметь значения 0 и 1 (когда оба процесса выполняют цикл, значения переменных измениться не могут). Пусть процесс P₀ первым вошел в критический участок , тогда процесс P₁ должен был выполнить перед вхождением в цикл while по крайней мере один предваряющий оператор (turn = 0;) . Однако после этого он не может выйти из цикла до окончания критического участка процесса P₀ , так как при входе в цикл ready[0] == 1 и turn == 0 , и эти значения не могут измениться до тех пор, пока процесс P₀ не покинет свой критический участок . Мы пришли к противоречию. Следовательно, имеет место взаимоисключение .

Докажем выполнение условия прогресса . Возьмем, без ограничения общности, процесс P₀ . Заметим, что он не может войти в свою критическую секцию только при совместном выполнении условий ready[1] == 1 и turn == 1 . Если процесс P₁ не готов к выполнению критического участка , то ready[1] == 0 , и процесс P₀ может осуществить вход. Если процесс P₁ готов к выполнению критического участка , то ready[1] == 1 и переменная turn имеет значение 0 либо 1 , позволяя процессу P₀ либо процессу P₁ начать выполнение критической секции . Если процесс P₁ завершил выполнение критического участка , то он сбросит свой флаг готовности ready[1] == 0 , разрешая процессу P₀ приступить к выполнению критической работы. Таким образом, условие прогресса выполняется.

Отсюда же вытекает выполнение условия ограниченного ожидания . Так как в процессе ожидания разрешения на вход процесс P₀ не изменяет значения переменных, он сможет начать исполнение своего критического участка после не более чем одного прохода по критической секции процесса P₁ .

Алгоритм булочной (Bakery algorithm)

Алгоритм Петерсона дает нам решение задачи корректной организации взаимодействия двух процессов. Давайте рассмотрим теперь соответствующий алгоритм для n взаимодействующих процессов, который получил название алгоритм булочной , хотя применительно к нашим условиям его следовало бы скорее назвать алгоритм регистратуры в поликлинике. Основная его идея выглядит так. Каждый вновь прибывающий клиент (он же процесс) получает талончик на обслуживание с номером. Клиент с наименьшим номером на талончике обслуживается следующим. К сожалению, из-за неатомарности операции вычисления следующего номера алгоритм булочной не гарантирует, что у всех процессов будут талончики с разными номерами. В случае равенства номеров на талончиках у двух или более клиентов первым обслуживается клиент с меньшим значением имени (имена можно сравнивать в лексикографическом порядке). Разделяемые структуры данных для алгоритма – это два массива

Изначально элементы этих массивов инициируются значениями false и 0 соответственно. Введем следующие обозначения

std:: atomic_flag::test_and_set

Sets the atomic_flag and returns whether it was already set immediately before the call.

The entire operation is atomic (an atomic read-modify-write operation): the value is not affected by other threads between the instant its value is read (to be returned) and the moment it is modified by this function.

Parameters

sync Synchronization mode for the operation.
This can be any of the possible values of the enum type memory_order :

value	memory order	description
memory_order_relaxed	Relaxed	No synchronization of side effects.
memory_order_consume	Consume	Synchronizes the visible side effects on values carrying dependencies from the last release or sequentially consistent operation.
memory_order_acquire	Acquire	Synchronizes all visible side effects from the last release or sequentially consistent operation.
memory_order_release	Release	Synchronizes side effects with the next consume or acquire operation.
memory_order_acq_rel	Acquire/Release	Reads as an acquire operation and writes as a release operation (as described above).
memory_order_seq_cst	Sequentially consistent	Synchronizes all visible side effects with the other sequentially consistent operations, following a single total order.

Return value

Example

Possible output (order of lines may vary):

The Infinite Loop

Tales from a lean programmer.

Test-and-set spinlocks

In my previous blog post I wrote about the most important properties of spinlocks and why they matter. This time I’ll present the first out of three concrete spinlock implementations in C++11 for x86 processors. I’m assuming that you’ve read the first post. You can find all source code in this Github repository.

Introduction

The Test-And-Set (TAS) Lock is the simplest possible spinlock implementation. It uses a single shared memory location for synchronization, which indicates if the lock is taken or not. The memory location is updated using a test-and-set ( TAS ) operation. TAS atomically writes to the memory location and returns its old value in a single indivisible step. Actually, the name test-and-set is a little misleading, because the caller is responsible for testing if the operation has succeeded or not. The TAS Lock is not fair as it doesn’t guarantee FIFO ordering amongst the threads competing for the lock.

In C++11 std::atomic_bool::exchange() can be used to perform a TAS operation on an std::atomic_bool synchronization variable. On x86 CPUs std::atomic::exchange() is turned into the LOCK XCHG instruction (side note: the LOCK prefix is implicit and not strictly required, because there’s no non-atomic version of XCHG ). The following code implements the described TAS Lock.

The TasSpinLock::Locked flag is cache line size padded using alignas to prevent false sharing. You can remove the padding, e.g. in memory-limited environments, when false sharing is not an issue.

The test-and-test-and-set optimization

While the TAS Lock is extremely easy to implement, its scalability is very bad. Already with just a few threads competing for the lock, the amount of required cache line invalidations to acquire/release the lock quickly degrades performance. The problem is two-fold.

std::atomic_bool::exchange() always invalidates the cache line Locked resides in, regardless of whether it succeeded in updating Locked or not.
When the spinlock is released, all waiting threads simultaneously try to acquire it. This is sometimes referred to as Thundering Herd problem. Acquiring the lock means first invalidating the cache line copy of all threads waiting for the lock and then reading the valid cache line copy from the core which released the lock. With t threads this results in O(t) memory bus transactions.

The latter issue is inherent to TAS Locks, because just a single synchronization variable is used and shared between all threads. However, the first issue can be fixed by testing if the lock is free before calling exchange() . First testing if the lock is free only loads the synchronization variable and therefore doesn’t cause a cache line invalidation. This spinlock variant is called Test-And-Test-And-Set (TTAS) Lock. The improved Enter() method looks like this:

Exponential back-off

The TTAS Lock successfully reduces the amount of cache line invalidations occurring while the lock is trying to be acquired. However, as soon as the lock is released, again all threads try to update the Locked flag. If a thread sees that the lock is free, but fails acquiring it subsequently because some other thread was faster, the lock is most probably under high contention. In this situation it can help to wait for a short time to let other threads finish before trying to enter the critical section (CS) again.

Waiting a short time without trying to acquire the lock reduces the number of threads simultaneously competing for the lock. The larger the number of unsuccessful retries, the higher the lock contention and the longer the thread should back-off. Experiments have shown that a good strategy is to back-off exponentially; similar to the congestion avoidance mechanism used in the Ethernet protocol. To avoid threads running in lock step the initial wait time is randomized.

The downside of waiting is that it renders the lock even unfairer than it is already. Threads that have been attempting to acquire the lock longest are also backing-off longest. Consequently, newly arriving threads have a higher chance to acquire the lock than older threads. On top of that threads might back-off for too long, causing the CS to be underutilized. Furthermore, there are unfortunately no single best values for MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS . Optimally, they’re determined experimentally depending on the workload (contention, size of CS).

Pausing and sleeping

There are two more improvements we can do. First, according to the Intel Architectures Optimization Reference Manual, adding a PAUSE instruction to the body of all spin loops improves performance on Pentium 4 CPUs and later. The PAUSE instruction provides a hint to the CPU that the code being executed is a spin-wait loop. PAUSE is backwards compatible and turns into a NOP instruction on older CPUs. There are three reasons why using PAUSE improves performance:

It avoids memory order violations which result in expensive pipeline flushes, because the CPU doesn’t go ahead to fill its pipeline with speculatively executed load and compare instructions.
It gives the other hardware thread time to execute on simultaneous multi-threading (SMT) processors (e.g. Intel’s Hyper Threading).
It adds a slight delay to the spin loop, which synchronizes it with the system’s memory bus frequency. This frequency is typically much lower than the frequency of the CPU, which in turn reduces power consumption considerably.

The second improvement is to stop spinning, put the thread to sleep and reschedule if the the lock doesn’t become available after some time. The important bit here is to not yield ( sched_yield() on Linux, SwitchToThread() on Windows, or more generally std::this_thread::yield() in C++11), but to explicitly put the thread to sleep for a short period of time. This ensures that the thread is really paused for some time and isn’t immediately run again by the scheduler if there’s no other thread available in the scheduler’s run queue. This is especially important on SMT processors where every two logical cores share most of the execution units. Only when the thread is really sleeping the other logical core can fully make use of the shared execution units. The following TTAS Lock implementation uses the PAUSE instruction and sleeps for 500us if the lock isn’t released within MAX_WAIT_ITERS iterations. There’s no single best value for MAX_WAIT_ITERS . Ideally, it’s chosen experimentally like the MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS constants from before.

Performance comparison

The benchmark used to compare the different TAS Lock variants is extremely simple and should be taken with a grain of salt. It launches t threads, each competing for the lock and incrementing a global atomic counter inside the CS. This causes high lock contention, because all threads are trying to enter the lock at the same time. Though, it’s important to keep in mind that due to the lock’s lack of fairness, it’s entirely possible that the same thread acquires the lock multiple times in a row. I ran the benchmark on a dual socket NUMA system with two Intel Xeon E5-2630 v2 CPUs with 2.60 GHz. The CPUs support Hyper Threading which gives a total of 12 physical and 24 logical cores. The benchmark results are shown below.

The most interesting finding is that the amount of lock reacquisitions is a lot higher than I anticipated. This can be derived from the fact that all TTAS-based locks scale almost linearly, where typically the observed drop in performance for TAS -based locks is exponential. Large numbers of lock reacquisitions by the same thread reduces the amount of cache line invalidations and increases throughput. At the same time the latency of all other threads goes up, because they’re granted the lock later.

The TasSpinLock scales so much worse, even though lock reacquisiton happens, because while spinning the TAS operation causes cache line invalidations also if it doesn’t succeed in acquiring the lock. By looking at the difference between the TasSpinLock and the other TTAS-based versions, it’s highly visible how big the influence of cache line invalidations is on the lock’s performance.

Test-and-set

Test-and-set — простая неразрывная (атомарная) процессорная инструкция, которая копирует значение переменной в регистр, и устанавливает некое новое значение. Во время исполнения данной инструкции процессор не может прервать её выполнение и переключится на выполнение другого потока. Если используется многопроцессорная архитектура, то пока один процессор выполняет эту инструкцию с ячейкой памяти, то другие процессоры не могут получить доступ к этой ячейке, может достигаться путем кратковременного блокирования шины памяти.

При этом разблокирование ячейки производится обычной процедурой MOV:

Это заготовка статьи о компьютерах. Вы можете помочь проекту, исправив и дополнив её.
Это примечание по возможности следует заменить более точным.

Wikimedia Foundation . 2010 .

Смотреть что такое «Test-and-set» в других словарях:

Test-and-set — In computer science, the test and set instruction is an instruction used to both test and (conditionally) write to a memory location as part of a single atomic (i.e. non interruptible) operation. This means setting a value, but first performing… … Wikipedia

Test and Test-and-set — In computer science, the test and set CPU instruction is used to implement mutual exclusion in multiprocessor environments. Although a correct lock can be implemented with test and set, it can lead to memory contention in busy lock (caused by bus … Wikipedia

Test-and-set lock — TSL son las siglas de Test and set lock, una instrucción hardware utilizada por ciertos procesadores para facilitar la creación de semáforos y otras herramientas necesarias para la programación concurrente en computadores. TSL es una instrucción… … Wikipedia Español

Test and tagging — is a generic name given to the process of visually inspecting and electrically testing in service electrical equipment for personal use and/or safety. Colloquially, it is also referred to as; tagging, test tag, test and tag, electrical tagging,… … Wikipedia

Test and learn — is a set of practices followed by retailers and consumer focused companies to test ideas in a small number of locations or customers to predict impact. The process is often designed to answer three questions about any tested program before… … Wikipedia

Test (Unix) — test is a Unix command that evaluates conditional expressions.yntax test expression or [ expression ] DescriptionThe test command evaluates the expression parameter. In the second form of the command, the [ ] (brackets) must be surrounded by… … Wikipedia

Test fixture — refers to the fixed state used as a baseline for running tests in software testing. The purpose of a test fixture is to ensure that there is a well known and fixed environment in which tests are run so that results are repeatable. Some people… … Wikipedia

Test-driven development — (TDD ) is a software development technique consisting of short iterations where new test cases covering the desired improvement or new functionality are written first, then the production code necessary to pass the tests is implemented, and… … Wikipedia

Test light — Neon test lamp for line voltages A test light, test lamp, voltage tester, or mains tester is a very simple piece of electronic test equipment used to determine the presence or absence of an electric voltage in a piece of equipment under test.… … Wikipedia

test — The event of a price movement that approaches a support level or a resistance level established earlier by the market. A test is passed if prices do not go below the support or resistance level, and the test is failed if prices go on to new lows… … Financial and business terms

Test and set windows