Содержание

GetThreadTimes function (processthreadsapi.h)
Syntax
Parameters
Return value
Remarks
How to get the cpu usage per thread on windows (win32)
7 Answers 7
Efficient way of getting thread CPU time using JMX
5 Answers 5
OS thread scheduling and cpu usage relations
2 Answers 2
CPU Analysis
Background
Processor Power Management
Processor Usage Management
Processes and Threads
Priority
Ideal Processor and Affinity
Quantum
State
DPCs and ISRs
Windows ADK Tools
WindowsВ ADK Assessment Results Files
WPA Graphs
CPU Idle States Graph
State by Type, CPU
State Diagram by Type, CPU
CPU Frequency Graph
CPU Usage (Sampled) Graph
Utilization by CPU
Utilization by Priority
Utilization by Process
Utilization by Process and Thread
CPU Usage (Precise) Graph
Timeline by CPU
Timeline by Process, Thread
Usage by Priority at Context Switch Begin
Utilization by CPU
Utilization by Process, Thread
DPC/ISR Graph
[DPC,ISR,DPC/ISR] Duration by CPU
[DPC,ISR,DPC/ISR] Duration by Module, Function
[DPC,ISR,DPC/ISR] Timeline by Module, Function
Stack Trees
Techniques
Define the Scenario and the Problem
Identify the Components and the Time Period
Create a Model
Use the Model to Identify Problems, and then Investigate Root Causes
Advanced Technique: Wait Analysis and the Critical Path
General Approach to Finding the Critical Path
Direct CPU Usage
Problem Identification
Investigation
Resolution
Thread Interference
Problem Identification
Investigation
Resolution
DPC/ISR Interference
Problem Identification
Investigation
Resolution

GetThreadTimes function (processthreadsapi.h)

Retrieves timing information for the specified thread.

Syntax

Parameters

A handle to the thread whose timing information is sought. The handle must have the THREAD_QUERY_INFORMATION or THREAD_QUERY_LIMITED_INFORMATION access right. For more information, see Thread Security and Access Rights.

Windows ServerВ 2003 and WindowsВ XP:В В The handle must have the THREAD_QUERY_INFORMATION access right.

A pointer to a FILETIME structure that receives the creation time of the thread.

A pointer to a FILETIME structure that receives the exit time of the thread. If the thread has not exited, the content of this structure is undefined.

A pointer to a FILETIME structure that receives the amount of time that the thread has executed in kernel mode.

A pointer to a FILETIME structure that receives the amount of time that the thread has executed in user mode.

Return value

If the function succeeds, the return value is nonzero.

If the function fails, the return value is zero. To get extended error information, call GetLastError.

Remarks

All times are expressed using FILETIME data structures. Such a structure contains two 32-bit values that combine to form a 64-bit count of 100-nanosecond time units.

Thread creation and exit times are points in time expressed as the amount of time that has elapsed since midnight on January 1, 1601 at Greenwich, England. There are several functions that an application can use to convert such values to more generally useful forms; see Time Functions.

Thread kernel mode and user mode times are amounts of time. For example, if a thread has spent one second in kernel mode, this function will fill the FILETIME structure specified by lpKernelTime with a 64-bit value of ten million. That is the number of 100-nanosecond units in one second.

To retrieve the number of CPU clock cycles used by the threads, use the QueryThreadCycleTime function.

How to get the cpu usage per thread on windows (win32)

Looking for Win32 API functions, C++ or Delphi sample code that tells me the CPU usage (percent and/or total CPU time) of a thread (not the total for a process). I have the thread ID.

I know that Sysinternals Process Explorer can display this information, but I need this information inside my program.

7 Answers 7

You must use these functions to get the cpu usage per thread and process.

GetThreadTimes (Retrieves timing information for the specified thread.)

GetProcessTimes (Retrieves timing information for the specified process.)

GetSystemTime (Retrieves the current system date and time. The system time is expressed in Coordinated Universal Time UTC)

Here a excellent article from Dr. Dobb’s Win32 Performance Measurement Options

The data you are refering to is available using specific WMI calls. You can query Win32_Process to get all sorts of process specific information, and query Win32_PerfFormattedData_PerfProc_Process to get the thread count, and given a handle to a thread (what I believe your looking for) you can query Win32_PerfRawData_PerfProc_Thread to get the percent of processor time used.

There is a library available for Delphi which provides wrappers for most of the WMI queries, however it will take some experimentation to get the exact query your looking for. The query syntax is very sql like, for example on my system to return the percent of processor time for threadid 8, for process id 4 is:

Most of the programs which present statistical information about running processes now use WMI to query for this information.

Is important to know that in certain situations, the execution time of a thread may be worthless. The execution times of each thread are updated every 15 milliseconds usually for multi-core systems, so if a thread completes its task before this time, the runtime will be reset. More details can be obtained on the link: GetThreadTimes function and I was surprised by the result!
and Why GetThreadTimes is wrong

With the help of RRUZ’s answer above I finally came up with this code for Borland Delphi:

Using «GetThreadTimes»? If you measure the time between calls to «GetThreadTimes» and store the previous user and/or kernel times then you know how much time the thread has had since you last checked. You also know how much time has elapsed in the mean time and thus you can work out how much CPU time was used. It would be best (for timer resolution reasons) to make this check every second or so and work out its average CPU usage over that second.

Here is a simple WMI query wrapper. With help of it you can call this to get data:

Also you might want to look at the Win32_PerfRawData_PerfProc_Thread documentation to see what other properties you can fetch.

Efficient way of getting thread CPU time using JMX

I’m currently getting the total thread CPU time using JMX in the following manner:

As the ThreadMXBean is actually a remote proxy, performance is dreadful, in the order of magnitude of seconds for this actual method call.

Is there a faster way of doing this?

Update: I’m using this for performance monitoring. The measurements were both ‘wall clock’ time and JProfiler, which showed about 85% of my time spent in this method. I do have some other MXBean calls ( Runtime, Memory, GC ), but they are much cheaper. Most likely because every call to thread.getThreadCpuTime is a remote one.

Update 2: JProfiler screenshot showing the performance problems.

5 Answers 5

If you are willing to use non-standard APIs, you can cast OperatingSystemMXBean to com.sun.management.OperatingSystemMXBean and invoke getProcessCpuTime() , as described in Using a Sun internal class to get JVM CPU time on David Robert Nadeau’s blog.

It must be the remote nature of your calls that is the problem.

I recently did a spike on asking a thread for CPU time uses and it was incredibly faster. In the context of a web application request it was nearly immeasurable. If one went into a hard loop it would costs you but typically you want it at the start of an operation and again at the end.

invoke getThreadCPUTime inside a thread pool, since it seems to be network-bound;
whenever a thread is found to be in Thread.STATE.TERMINATED , keep its name in a Map and skip querying the next time.

The cause of the poor performance is that each call of thread.getThreadCpuTime() will (in your case of a remote proxy) most probably result in an RMI/Remote JMX call which of course is expensive, especially when the message is actually sent via TCP (and maybe even outside localhost). (see JMX 1.4 spec, chapter 7.3)

Although JMX defines multiple attribute retrieval at once this will not help you out here because thread.getThreadCpuTime(long) is not an JMX attribute but a remote method invocation (and additionally, the ThreadInfo object on which the methods is invoked differs by each call)

Unfortunately, the JMX 1.4 spec does not define something like «invoke multiple methods of multiple objects and return their return values at all». So beside concurrent invocation (which will save time but will not reduce the effort) of these calls I am not aware of any (compatible) optimizations here.

OS thread scheduling and cpu usage relations

As I know, for threads scheduling, Linux implements a fair scheduler and Windows implements the Round-robin (RR) schedulers: each thread has a time slice for its execution (correct me if I’m wrong).

I wonder, is the CPU usage related to the thread scheduling?

For example: there are 2 threads executing at the same time, and the time slice for system is 15ms. The cpu has only 1 core.

Thread A needs 10ms to finish the job and then sleep 5ms, run in a loop.

Thread B needs 5ms to finish the job and then sleep 10ms, also in a loop.

Will the CPU usage be 100%?

How is the thread scheduled? Will thread A use up all its time and then schedule out?

One More Scenario: If I got a thread A running, that is then blocked by some condition (e.g network). Will the CPU at 100% affect the wakeup time of this thread? For example, a thread B may be running in this time window, will the thread A be preempted by the OS?

2 Answers 2

So the CPU usage will be 100%?

Ideally speaking, the answer would be yes and by ideally I mean , you are not considering the time wasted in doing performing a context switch. Practically , the CPU utilization is increased by keeping it busy all of the time but still there is some amount of time that is wasted in doing a context switch(the time it takes to switch from one process or thread to another).

But I would say that in your case the time constraints of both threads are aligned perfectly to have maximum CPU utilization.

And how is the thread scheduled? Will thread A use up all its time and then schedule out?

Well it really depends, in most modern operating systems implementations , if there is another process in the ready queue, the current process is scheduled out as soon as it is done with CPU , regardless of whether it still has time quantum left. So yeah if you are considering a modern OS design then the thread A is scheduled out right after 10ms.

As i know that Linux implements a fair scheduler and Windows System implements the Round-robin (RR) schedulers for threads scheduling,

Both Linux and Windows use priority-based, preemptive thread schedulers. Fairness matters but it’s not, strictly speaking, the objective. Exactly how these scheduler work depends on the version and the flavor (client vs. server) of the system. Generally, thread schedulers are designed to maximize responsiveness and alleviate scheduling hazards such as inversion and starvation. Although some scheduling decisions are made in a round-robin fashion, there are situations in which the scheduler may insert the preempted thread at the front of the queue rather than at the back.

each thread has a time slice for its execution.

The time slice (or quantum) is really more like a guideline than a rule. It’s important to understand that a time slice is divisible and it equals some variable number of clock cycles. The scheduler charges CPU usage in terms of clock cycles, not time slices. A thread may run for more than a time slice (e.g., a time slice and a half). A thread may also voluntarily relinquish the rest of its time slice. This is possible because the only way for a thread to relinquish its time slice is by performing a system call (sleep, yield, acquire lock, request synchronous I/O). All of these are privileged operations that cannot be performed in user-mode (otherwise, a thread can go to sleep without telling the OS!). The scheduler can change the state of the thread from «ready» to «waiting» and schedule some other ready thread to run. If a thread relinquishes the rest of its time slice, it will not be compensated the next time it is scheduled to run.

One particularly interesting case is when a hardware interrupt occurs while a thread is running. In this case, the processor will automatically switch to the interrupt handler, forcibly preempting the thread even if its time slice has not finished yet. In this case, the thread will not be charged for the time it takes to handle the interrupt. Note that the interrupt handler would be indeed utilizing the CPU. By the way, the overhead of context switching itself is also not charged towards any time slice. Moreover, on Windows, the fact that a thread is running in user-mode or kernel-mode by itself does not have an impact on its priority or time slice. On Linux, the scheduler is invoked at specific places in the kernel to avoid starvation (kernel preemption implemented in Linux 2.5+).

So the CPU usage will be 100%? And how is the thread scheduled? Will thread A use up all its time and then schedule out?

It’s easy to answer these questions now. When a thread goes to sleep, the other gets scheduled. Note that this happens even if the threads have different priorities.

If i got a thread running, and blocked by some condition(e.g network). Will the CPU 100% will affect the wakeup time of this thread? For example, another thread may running in its time window and will not schedule out by the OS?

Linux and Windows schedulers implement techniques to enable threads that are waiting on I/O operations to «wake up quickly» and get higher chances of being scheduled soon. For example, on Windows, the priority of a thread waiting on an I/O operation may be boosted a bit when the I/O operation completes. This means that it can preempt another running thread before finishing its time slice, even if both threads had the same priorities initially. When a boosted-priority thread wakes up, its original priority is restored.

CPU Analysis

This guide provides detailed techniques that you can use to investigate Central Processing Units (CPU)-related issues that impact assessment metrics.

The individual metric or issue sections in the assessment-specific analysis guides identify common problems for investigation. This guide provides techniques and tools that you can use to investigate those problems.

The techniques in this guide use the Windows Performance Analyzer (WPA) from the Windows Performance Toolkit (WPT). The WPT is part of the Windows Assessment and Deployment Kit (WindowsВ ADK) and it can be downloaded from the Windows Insider Program. For more information, see Windows Performance Toolkit Technical Reference.

This guide is organized into the following three sections:

This section describes how CPU resources are managed in Windows 10.

This section explains how to view and interpret CPU information in the WindowsВ ADK Toolkit.

This section contains a collection of techniques that you can use to investigate and solve common problems that are related to CPU performance.

Background

This section contains simple descriptions and a basic discussion on CPU performance. For a more comprehensive study on this topic, we recommend the book Windows Internals, Fifth Edition.

Modern computers can contain multiple CPUs that are installed in separate sockets. Each CPU can host multiple physical processor cores, each capable of processing one or two separate instruction streams simultaneously. These individual instruction stream processors are managed by the Windows operating system as logical processors.

In this guide, both processor and CPU refer to a logical processor вЂ” that is, a hardware device that the operating system can use to execute program instructions.

Windows 10 actively manages processor hardware in two main ways: power management, to balance power consumption and performance; and usage, to balance the processing requirements of programs and drivers.

Processor Power Management

Processors do not always exist in an operating state. When no instructions are ready to execute, Windows will put a processor into a target idle state (or C-State), as determined by the Windows Power Manager. Based on CPU usage patterns, a processorвЂ™s target C-state will be adjusted over time.

Idle states are numbered states from C0 (active; not idle) through progressively lower-power states. These states include C1 (halted but the clock is still enabled), C2 (halted and the clock is disabled), and so forth. The implementation of idle states is processor-specific. However, a higher state number in all processors reflects lower power consumption, but also a longer wait time before the processor can return to instruction processing. Time that is spent in idle states significantly affects energy use and battery life.

Some processors can operate in performance (P-) and throttle (T-) states even when they are actively processing instructions. P-states define the clock frequencies and voltage levels the processor supports. T-states do not directly change the clock frequency, but can lower the effective clock speed by skipping processing activity on some fraction of clock ticks. Together, the current P- and T- states determine the effective operating frequency of the processor. Lower frequencies correspond to lower performance and lower power consumption.

The Windows Power Manager determines an appropriate P- and T- state for each processor, based on CPU usage patterns and system power policy. Time that is spent in high-performance states versus low-performance states significantly affects energy use and battery life.

Processor Usage Management

Windows uses three major abstractions to manage processor usage.

Deferred Procedure Calls (DPCs) and Interrupt Service Routines (ISRs)

Processes and Threads

All user-mode programs in Windows run in the context of a process. A process includes the following attributes and components:

A virtual address space

Loaded program modules

Environment and configuration information

At least one thread

Although processes contain the program modules, context and environment, they are not directly scheduled to run on a processor. Instead, threads that are owned by a process are scheduled to run on a processor.

A thread maintains execution context information. Almost all computation is managed as part of a thread. Thread activity fundamentally affects measurements and system performance.

Because the number of processors in a system is limited, all threads cannot be run at the same time. Windows implements processor time-sharing, which allows a thread to run for a period of time before the processor switches to another thread. The act of switching between threads is called a context-switch and it is performed by a Windows component called the dispatcher. The dispatcher makes thread scheduling decisions based on priority, ideal processor and affinity, quantum, and state.

Priority

Priority is a key factor in how the dispatcher selects which thread to run. Thread priority is an integer from 0 to 31. If a thread is executable and has a higher priority than a currently running thread, the lower-priority thread is immediately preempted and the higher-priority thread is context-switched in.

When a thread is running or is ready to run, no lower-priority threads can run unless there are enough processors to run both threads at the same time, or unless the higher-priority thread is restricted to run on only a subset of available processors. Threads have a base priority that can be temporarily elevated to higher priorities at certain times: for example, when the process owns the foreground window, or when an I/O completes.

Ideal Processor and Affinity

A threadвЂ™s ideal processor and affinity determine the processors on which a given thread is scheduled to run. Each thread has an ideal processor that is set either by the program or automatically by Windows. Windows uses a round-robin methodology so that an approximately equal number of threads in each process are assigned to each processor. When possible, Windows schedules a thread to run on its ideal processor; however, the thread can occasionally run on other processors.

A threadвЂ™s processor affinity restricts the processors on which a thread will run. This is a stronger restriction than the threadвЂ™s ideal processor attribute. The program sets affinity by using SetThreadAffinityMask. Affinity can prevent threads from ever running on particular processors.

Quantum

Context switches are expensive operations. Windows generally allows each thread to run for a period of time that is called a quantum before it switches to another thread. Quantum duration is designed to preserve apparent system responsiveness. It maximizes throughput by minimizing the overhead of context switching. Quantum durations can vary between clients and servers. Quantum durations are typically longer on a server to maximize throughput at the expense of apparent responsiveness. On client computers, Windows assigns shorter quantums overall, but provides a longer quantum to the thread associated with the current foreground window.

State

Each thread exists in a particular execution state at any given time. Windows uses three states that are relevant to performance; these are: Running, Ready, and Waiting.

Threads that are currently being executed are in the Running state. Threads that can execute but are currently not running are in the Ready state. Threads that cannot run because they are waiting for a particular event are in the Waiting state.

A state to state transition is shown in Figure 1 Thread State Transitions:

Figure 1 Thread State Transitions

Figure 1 Thread State Transitions is explained as follows:

A thread in the Running state initiates a transition to the Waiting state by calling a wait function such as WaitForSingleObject or Sleep(> 0).

A running thread or kernel operation readies a thread in the Waiting state (for example, SetEvent or timer expiration). If a processor is idle or if the readied thread has a higher priority than a currently running thread, the readied thread can switch directly to the Running state. Otherwise, it is put into the Ready state.

A thread in the Ready state is scheduled for processing by the dispatcher when a running thread waits, yields (Sleep(0)), or reaches the end of its quantum.

A thread in the Running state is switched out and placed into the Ready state by the dispatcher when it is preempted by a higher priority thread, yields (Sleep(0)), or when its quantum ends.

A thread that exists in the Waiting state does not necessarily indicate a performance problem. Most threads spend significant time in the Waiting state, which allows processors to enter idle states and save energy. Thread state becomes an important factor in performance only when a user is waiting for a thread to complete an operation.

DPCs and ISRs

In addition to processing threads, processors respond to notifications from hardware devices such as network cards or timers. When a hardware device requires processor attention, it generates an interrupt. Windows responds to a hardware interrupt by suspending a currently running thread and executing the ISR that is associated with the interrupt.

During the time that it is executing an ISR, a processor can be prevented from handling any other activity, including other interrupts. For this reason, ISRs must complete quickly or system performance can degrade. To decrease execution time, ISRs commonly schedule DPCs to perform work that must be done in response to an interrupt. For each logical processor, Windows maintains a queue of scheduled DPCs. DPCs take priority over threads at any priority level. Before a processor returns to processing threads, it executes all of the DPCs in its queue.

During the time that a processor is executing DPCs and ISRs, no threads can run on that processor. This property can lead to problems for threads that must perform work at a certain throughput or with precise timing, such as a thread that plays audio or video. If the processor time that is used to execute DPCs and ISRs prevents these threads from receiving sufficient processing time, the thread might not achieve its required throughput or complete its work items on time.

Windows ADK Tools

The WindowsВ ADK writes hardware information and assessments to assessments results files. WPA provides detailed information about CPU usage in various graphs. This section explains how to use the WindowsВ ADK and WPA to collect, view, and analyze CPU performance data.

WindowsВ ADK Assessment Results Files

Because Windows supports symmetric multiprocessing systems only, all information in this section applies to all installed CPUs and cores.

Detailed CPU hardware information is available in the EcoSysInfo section of an assessment result files under the

WPA Graphs

After you load a trace into WPA, you can find processor hardware information under the Trace/System Configuration/General and Trace/System Configuration/PnP sections of the WPA UI.

NoteВ В All procedures in this guide occur in WPA.

CPU Idle States Graph

If idle state information is collected in a trace, the Power/CPU Idle States graph will display in the WPA UI. This graph always contains data on the Target idle state for each processor. The graph will also contain information on each processorвЂ™s Actual idle state if this state is supported by the processor.

Each row in the following table describes an idle state change for either the Target or Actual state of a processor. The following columns are available for each row in the graph:

The processor that is affected by the state change.

The time that the processor entered the idle state.

The time that the processor exited the idle state.

The time that is spent in the idle state (default aggregation:maximum).

The time that is spent in the idle state (default aggregation:minimum).

The state to which the processor transitioned after the current state.

The state from which the processor transitioned before the current state.

The current idle state.

The current idle state as a number (for example, 0 for C0).

The time that is spent in the idle state (default aggregation:sum).

Either Target (for the Power ManagerвЂ™s selected target state for the processor) or Actual (for the actual idle state of the processor).

The default WPA profile provides two presets for this graph: State by Type, CPU and State Diagram by Type, CPU.

State by Type, CPU

The Target and Actual states of each CPU are graphed together with the state number on the Y axis in the State by Type, CPU graph. Figure 2 CPU Idle States State by Type, CPU shows the Actual state of the CPU as it fluctuates between Active and Target idle states.

Figure 2 CPU Idle States State by Type, CPU

State Diagram by Type, CPU

In this graph, the Target and Actual states of each CPU are presented in timeline format. Each state has a separate row in the timeline. Figure 3 CPU Idle States State Diagram by Type, CPU shows the same data as Figure 2 CPU Idle States State by Type, CPU, in a timeline view.

Figure 3 CPU Idle States State Diagram by Type, CPU

CPU Frequency Graph

If CPU frequency data was collected on a system that supports multiple P- or T-states, the CPU Frequency graph will be available in the WPA UI. Each row in the following table represents time at a particular frequency level for a processor. The Frequency (MHz) column contains a limited number of frequencies that correspond to the P-states and T-states that are supported by the processor. The following columns are available for each row in the graph:

Column	Details

Duration is expressed as a percentage of total CPU time over the currently visible time period.

The number of frequency changes (always 1 for individual rows).

The CPU that is affected by the frequency change.

The time that the CPU entered the P-state.

The time that the CPU exited the P-state.

The frequency of the CPU during the time that it is in the P-state.

The time that is spent in the P-state (default aggregation:maximum).

The time that is spent in the P-state (default aggregation:minimum).

The time that is spent in the P-state (default aggregation:sum).

Additional information on the P-State.

The default profile defines the Frequency by CPU preset for this graph. Figure 4 CPU Frequency by CPU shows a CPU as it transitions between three P-states:

Figure 4 CPU Frequency by CPU

CPU Usage (Sampled) Graph

The data that is displayed in the CPU Usage (Sampled) graph represents samples of CPU activity that are taken at a regular sampling interval. In most traces, this is one millisecond (1ms). Each row in the table represents a single sample.

The weight of the sample represents the significance of that sample, relative to other samples. The weight is equal to the timestamp of the current sample minus the timestamp of the previous sample. The weight is not always exactly equal to the sampling interval because of fluctuations in system state and activity.

Figure 5 CPU Sampling represents how data is collected:

Figure 5 CPU Sampling

Any CPU activity that occurs between samples is not recorded by this sampling method. Therefore, activities of very short duration such as DPCs and ISRs are not well represented in the CPU Sampling graph.

The following columns are available for each row in the graph:

Column	Details

Weight is expressed as a percentage of total CPU time that is spent over the currently visible time range.

The memory address of the function that is at the top of the stack.

The number of samples represented by a row. This number includes samples that are taken when a processor is idle. For individual rows, this column is always 1.

The number of samples represented by a row, excluding samples that are taken when a processor is idle. For individual rows, this column is always 1 (or 0, for cases when the CPU was in a low power state).

The 0-based index of the CPU on which this sample was taken.

The display name of the active process.

Whether the sample measured regular CPU usage, a DPC/ISR, or a low power state.

The function at the top of the stack.

The module that contains the function at the top of the stack.

The priority of the running thread.

The image name of the process that owns the running code.

The full name (including Process ID) of the process that owns the running code.

The stack of the running thread.

The ID of the running thread.

Thread Start Function

The function with which the running thread started.

Thread Start Module

The module that contains the Thread Start Function.

The time that the sample was taken.

The time (in milliseconds) that is represented by the sample (that is, the time since the last sample).

The default profile provides the following presets for this graph:

Utilization by CPU

Utilization by Priority

Utilization by Process

Utilization by Process and Thread

Utilization by CPU

The CPU Usage Utilization by CPU graph shows how work is distributed between processors. Figure 6 CPU Usage Utilization by CPU shows this distribution for two CPUs:

Figure 6 CPU Usage Utilization by CPU

Utilization by Priority

CPU Usage grouped by thread priority shows how high-priority threads impact lower-priority threads. Figure 7 CPU Usage (Sampled) Utilization by Priority displays this graph:

Figure 7 CPU Usage (Sampled) Utilization by Priority

Utilization by Process

CPU Usage that is grouped by process shows the relative usage of processes. Figure 8 CPU Usage (Sampled) Utilization by Process shows this preset. In this sample graph, one process is shown to be consuming more CPU time than the other processes.

Figure 8 CPU Usage (Sampled) Utilization by Process

Utilization by Process and Thread

CPU Usage that is grouped by process and then grouped by thread shows the relative usage of processes and the threads in each process. Figure 9 CPU Usage (Sampled) Utilization by Process and Thread shows this preset. The threads of a single process are selected in this graph.

Figure 9 CPU Usage (Sampled) Utilization by Process and Thread

CPU Usage (Precise) Graph

The CPU Usage (Precise) graph records information that is associated with context switching events. Each row represents a set of data that is associated with a single context switch; that is, when a thread started running. Data is collected for the following event sequence:

The new thread is switched out.

The new thread is made ready to run by the readying thread.

The new thread is switched in, thereby switching out an old thread.

The new thread is switched out again.

In Figure 10 CPU Usage Precise Diagram, time flows from left to right. The diagram labels correspond to column names in the CPU Usage (Precise) graph. Labels for Timestamp columns display at the top of the diagram, and labels for Interval Duration columns display at the bottom of the diagram.

Figure 10 CPU Usage Precise Diagram

Breaks in the timeline in Figure 10 CPU Usage Precise Diagram divide the timeline into regions that can occur simultaneously on different CPUs. These timelines can overlap as long as the order of the numbered events is not modified. For example, the Readying Thread can run on Processor-2 at the same time that a new thread is switched out and then back in on Processor-1).

Information is recorded for the following four targets on the timeline:

New thread, which is the thread that was switched in. It is the primary focus of this row in the graph.

NewPrev thread, which refers to the previous time that the new thread was switched in.

Readying thread, which is the thread that prepared the new thread to be processed.

Old thread, which is the thread that was switched out when the new thread was switched in.

The data in the following table relates to each target thread:

Column	Details

The CPU usage of the new thread after it is switched. This value is expressed as a percentage of total CPU time over the currently visible time period.

The number of context switches that are represented by the row. This is always 1 for individual rows.

The number of waits that are represented by the row. This is always 1 for individual rows except when a thread is switched to an idle state; in this case, it is set to 0.

The CPU on which the context switch occurred.

The CPU usage of the new thread after the context switch. This is equal to the NewInSwitchTime, but is displayed in milliseconds.

The ideal CPU selected by the scheduler for the new thread.

The previous time that the new thread was switched out.

The priority of the new thread that is switched in.

NextSwitchOutTime(s) minus SwitchInTime(s)

The priority of the new thread when it switches out.

The priority of the new thread when it was previously switched out.

The state of the new thread after it was previously switched out.

The Wait Mode of the new thread when it was previously switched out.

The reason that the new thread was switched out.

The priority boost that affects the thread.

The process of the new thread.

The name of the process of the new thread, including PID.

The state of the new thread after it is switched in.

The thread ID of the new thread.

The stack of the new thread when it is switched in.

The start function of the new thread.

The start module of the new thread.

The wait mode of the new thread.

The reason that the new thread was switched out.

The time when the new thread was next switched out.

The time that the old thread was switched in before it was switched out.

The priority of the old thread when it was switched out.

The process that owns the old thread.

The name of the process that owns the old thread, including PID.

The state of the old thread after it is switched out.

The thread ID of the old thread.

The start function of the old thread.

The start module of the old thread.

The wait mode of the old thread.

The reason that the old thread was switched out.

The previous CState of the processor. If this is not 0 (Active), the processor was in an idle state before the new thread was context-switched in.

SwitchInTime(s) minusReadyTime (s)

The thread ID of the readying thread.

The start function of the readying thread.

The start module of the readying thread.

The process that owns the readying thread.

The name of the process that owns the readying thread, including PID.

The stack of the readying thread.

The time when the new thread was readied.

The time when the new thread was switched in.

SwitchInTime(s) minus LastSwitchOutTime (s)

ReadyTime (s) minus LastSwitchOutTime (s)

The default profile uses the following presets for this graph:

Timeline by CPU

Timeline by Process, Thread

Usage by Priority at Context Switch Begin

Utilization by CPU

Utilization by Process, Thread

Timeline by CPU

CPU Usage on a per-CPU timeline shows how work is distributed among processors. Figure 11 CPU Usage (Precise) Timeline by CPU displays the timeline on an eight-processor system:

Figure 11 CPU Usage (Precise) Timeline by CPU

Timeline by Process, Thread

CPU Usage on a per-process, per-thread timeline, shows which processes had threads running at certain times. Figure 12 Usage (Precise) Timeline by Process, Thread shows this timeline across several processes:

Figure 12 Usage (Precise) Timeline by Process, Thread

Usage by Priority at Context Switch Begin

This graph identifies bursts of high-priority thread activity at each priority level. Figure 13 CPU Usage (Precise) Usage by Priority at Context Switch Begin shows the distribution of priorities:

Figure 13 CPU Usage (Precise) Usage by Priority at Context Switch Begin

Utilization by CPU

In this graph, CPU usage is grouped and graphed by CPU to show how work is distributed among processors. Figure 14 CPU Usage (Precise) Utilization by CPU shows this graph for a system that has eight processors.

Figure 14 CPU Usage (Precise) Utilization by CPU

Utilization by Process, Thread

In this graph, CPU usage is grouped first by process and then by thread. It shows the relative usage of processes and the threads in each process Figure 15 CPU Usage (Precise) Utilization by Process, Thread shows this distribution across multiple processes:

Figure 15 CPU Usage (Precise) Utilization by Process, Thread

DPC/ISR Graph

The DPC/ISR graph is the primary source for DPC/ISR information in WPA. Each row in the graph represents a fragment, which is a period of time during which a DPC or ISR ran uninterrupted. Data is collected at the start and end of fragments. Additional data is collected when a DPC/ISR is complete. Figure 16 DPC/ISR Diagram shows how this works:

Figure 16 DPC/ISR Diagram

Figure 16 DPC/ISR Diagram describes data that was collected during the following activities:

DPC/ISR-A starts to run.

A device interrupt that has a higher interrupt level than DPC/ISR-A causes ISR-B to interrupt DPC/ISR A, thereby ending the first fragment of DPC/ISR-A.

ISR-B completes and thereby ends the fragment of ISR-B. DPC/ISR-A resumes execution in a second fragment.

DPC/ISR-A completes, thereby ending the second fragment of DPC/ISR-A.

A row for each fragment is displayed in the data table. The fragments for DPC/ISR-A share identical information with non-fragment columns.

The columns for the DPC/ISR graph describe fragment-level information, or DPC/ISR level columns. Each fragment dissimilar data in fragment-level columns, and identical data in DPC/ISR columns.

Column	Details

Duration (fragmented) that is expressed as a percentage of total CPU time over the currently visible time period.

Exclusive duration that is expressed as a percentage of total CPU time over the currently visible time period.

Inclusive duration that is expressed as a percentage of total CPU time over the currently visible time period.

The memory address of the DPC or ISR function.

The count of DPCs/ISRs that are represented by this row. This is always 1 for rows that represent the final fragment of a DPC/ISR; otherwise, this count is 0.

The number of fragments that are represented by the row. This is always 1 for individual rows.

The index of the logical processor on which the DPC or ISR ran.

For DPC, the type of DPC,- either Regular or Timer. This value is blank for an ISR.

DPC/ISR Enter Time (s)

The time in the trace when the DPC/ISR started.

DPC/ISR Exit Time (s)

The time from the beginning of the trace to when the DPC/ISR completed.

Duration (Fragmented) (ms)

Fragment Exit Time (s) minus Fragment Enter Time (s) in milliseconds.

Exclusive Duration (ms)

The sum of fragmented durations in ms. for all fragments of this DPC/ISR.

If the DPC/ISR of this row had multiple fragments, this value is True; otherwise, it is False.

If this was not the only fragment for this DPC/ISR, this value is True; otherwise, it is False.

Fragment Enter Time (s)

The time that the fragment started to run.

Fragment Exit Time (s)

The time that the fragment stopped running.

The DPC or ISR function that ran.

Inclusive Duration (ms)

DPC/ISR Exit Time (s) minus DPC/ISR Enter Time (s) in milliseconds.

The interrupt index for message-signaled interrupts.

The module that contains the DPC or ISR function.

The return value of the DPC/ISR

The type of event; this is- either DPC or Interrupt (ISR).

The value of the interrupt vector on the device.

The default profile uses the following presets for this graph:

[DPC,ISR,DPC/ISR] Duration by CPU

[DPC,ISR,DPC/ISR] Duration by Module, Function

[DPC,ISR,DPC/ISR] Timeline by Module, Function

[DPC,ISR,DPC/ISR] Duration by CPU

DPC/ISR events are aggregated by the CPU on which they ran and are sorted by duration. This graph shows the allocation of DPC activity across CPUs. Figure 17 DPC/ISR Duration by CPU shows this graph for a system that has eight processors.

Figure 17 DPC/ISR Duration by CPU

[DPC,ISR,DPC/ISR] Duration by Module, Function

DPC/ISR events are aggregated in this graph by the module and function of the DPC/ISR routines, and are sorted by duration. This shows which DPC/ISR routines consumed the most time Figure 18 DPC/ISR Duration by Module, Function shows a period of time that incurs DPC/ISR activity in two modules:

Figure 18 DPC/ISR Duration by Module, Function

[DPC,ISR,DPC/ISR] Timeline by Module, Function

DPC/ISR events are aggregated in this graph by the module and function of the DPC/ISR routines. They are graphed as a timeline. This graph provides a detailed view of the time period over which DPCs/ISRs ran. This graph can also show how single DPC/ISRs can be fragmented. Figure 19 DPC/ISR Timeline by Module, Function shows a timeline of activity in three modules:

Figure 19 DPC/ISR Timeline by Module, Function

Stack Trees

Stack trees are displayed in the CPU Usage (Sampled), CPU Usage (Precise) and DPC/ISR tables in WPA, and in issues that are reported in assessment reports. Stack trees portray the call stacks that are associated with multiple events over a period of time. Each node in the tree represents a stack segment that is shared by a subset of the events. The tree is constructed from the individual stacks and is shown in Figure 20 Stacks from Three Events:

Figure 20 Stacks from Three Events

Figure 21 Common Segments Identified shows how common sequences are identified for this graph:

Figure 21 Common Segments Identified

Figure 22 Tree Built from Stacks shows how the common segments are combined to form the nodes of a tree:

Figure 22 Tree Built from Stacks

The Stacks column in the WPA UI contains an expander for each non-leaf node. In assessment-reported issues, the tree is displayed together with the aggregate weights. Some branches can be removed from the graph if their weights do not meet a specified threshold. The sample stack below shows how the events represented above are displayed as part of an assessment-reported issue.

The node in a stack represents the time that a function itself is at the top of the stack. The node does not include the time that is spent in functions that are called by the parent function. That duration is called the exclusive time spent in the function.

For example, Function1 calls Function2. Function2 spent 2ms in a CPU-intensive loop and called another function that ran for 4ms. This can be represented by the following stack:

Techniques

This section describes a standard approach to performance analysis. It provides techniques that you can use to investigate common CPU-related performance problems.

Performance analysis is a four-step process:

Define the scenario and the problem.

Identify the components that are involved and the relevant time range.

Create a model of what should have happened.

Use the model to identify problems and investigate root causes.

Define the Scenario and the Problem

The first step in performance analysis is to clearly define the scenario and the problem. Many performance problems affect scenarios that are measured by assessment metrics. For example:

Scenario 1: A physical resource is not being fully utilized. For example, a server cannot fully utilize a network connection because it cannot encrypt packets quickly enough.

Scenario 2: A physical resource is being utilized more than it should be. For example, a system uses significant CPU resources during an idle period that uses battery power.

Scenario 3: Activities are not being completed at a required rate. For example, frames are dropped during video playback because the frames are not being decoded quickly enough.

Scenario 4: An activity was delayed. For example, the user launched Internet Explorer but it took longer than expected to open a tab.

Scenarios 3 and 4 as they related to CPU resources are covered in this guide. Scenarios 1 and 2 are out of scope and are not covered. To analyze these problems, you can start with an ambiguous observation such as вЂњit is too slowвЂќ and ask additional questions to identify the scenario and the exact problem.

Identify the Components and the Time Period

After the scenario and the problem are identified, you can identify the components that are involved and the time period of interest. The components include hardware resources, processes, and threads.

You can often find the time range of interest by identifying the associated activity in the analysis guide. An activity is an interval between a start event and a stop event that you can select and zoom into, in WPA. If an activity is not defined, you can find the time range by looking for specific generic events that are associated with the scenario, or by looking for changes in resource utilization that might mark the beginning and end of a scenario. For example, if the CPU was idle for two seconds and then fully utilized for four seconds and then idle again for two seconds, the four seconds of full utilization might be the area of interest in a trace that captures video playback.

Create a Model

To understand the root causes of a problem, you must have a model of what should have happened. The model starts with the problem or any associated goal for the metric; for example, вЂњThis operation should have completed in less than 5 seconds.вЂќ

A more complete model contains information about how the components should perform. For example, what communication is expected between components? What resource utilization is typical? How long do operations usually take?

Information for the model can often be found in the assessment analysis guide. If that resource is not available, you can produce a trace from similar hardware and software that does not exhibit the performance problem, to create a model.

Use the Model to Identify Problems, and then Investigate Root Causes

After you have a model, you can compare a trace to the model to identify problems. For example, a model for a particular activity called Suspend Devices might suggest that the entire activity should complete in three seconds, while each instance of a sub-activity called Suspend should take no more than 100ms. If two instances of the sub-activity Suspend each take 800ms, you should investigate those instances.

Each deviation from the model can be analyzed to find a root cause. You should examine the state of the involved threads and look for common root causes. A few major CPU-related root causes, for activities that do not complete at a required rate or are delayed, are described here:

Direct CPU usage: The appropriate threads received full CPU resources but the required program did not execute quickly enough. This can be caused by a program malfunction or by slow hardware.

Thread Interference: A thread did not get enough running time because other threads were running instead. In this case, the thread is considered to be starved or preempted.

DPC/ISR Interference: Threads did not get enough running time because CPUs were busy processing DPCs or ISRs.

In many cases, one of these root causes does not noticeably affect the thread, and the thread spends most of its time in a waiting state. In this case, you must identify and investigate the event for which the thread is waiting. This recursive type of investigation is called wait analysis, and it starts by identifying the critical path.

Advanced Technique: Wait Analysis and the Critical Path

An activity is a network of operations, some sequential and some parallel, that flow from a start event to an end event. Any start/end event pair in a trace can be viewed as an activity. The longest path through this network of operations is known as the critical path. Reducing the duration of any operation on the critical path directly reduces the duration of the overall activity, although it can also change the critical path.

Figure 23 Activity Operations shows the activity of three threads. Thread-1 sends the activity start event and then waits for Thread-2 and Thread-3 to complete their tasks. Thread-2 completes its task first, followed by Thread-3. When both threads have completed their tasks, Thread-1 is readied and completes the activity event.

Figure 23 Activity Operations

In this scenario, the critical path includes parts of Thread-3 and Thread-1. These are traced in Figure 24 Critical Path. Because Thread-2 is not on the critical path, the time that it takes to complete its task does not affect the overall activity time.

Figure 24 Critical Path

The critical path is a low-level literal answer to the question of why an activity took as much time as it did. After key segments of critical path are known, they can be analyzed to find the problems that contribute to the overall delay.

General Approach to Finding the Critical Path

The first step to finding the critical path is to review the scenario model to understand the purpose and implementation of the activity.

Understanding an activity can help identify specific operations, processes, and threads that might be on the critical path. For example, a delay in the Fast Startup Resume Explorer Init activity can be caused by RunOnce applications and the Explorer initialization process, both of which require a significant amount of I/O.

After you review the scenario model, check to see whether the assessment reported any issues for the affected activity. Many times, an approximation of the critical path is included in assessment-reported delay issues. The critical path is shown as a sequence of waits and ready actions. It can be read from start to finish as a sequence of events, with the primary delayed segment of the critical path in the middle of the list. The last entry in the list is the action that readied the thread that completed the activity.

If you must manually look for the critical path, we recommend that you identify the process and the thread that completed the activity and work backwards from the instant that the activity completed. You can identify the process and thread that started an activity, and the process and thread that completed an activity, through the Activities graph in WPA.

The Activities graph displays when the trace is loaded through an assessment results XML file. To identify the process and the thread that are associated with a particular activity, expand the graph to the Activity of interest and then switch the view to Graph+Table. Set the graph mode to Table. The Start Process, Start Thread Id, End Process, and End Thread Id columns display for each activity in the table.

After you know the start and end process, the thread, and the implementation of the activity, the critical path can be traced backwards. Start by analyzing the thread that completed the activity, to determine how that thread spent most of its time: running, ready, or waiting.

Significant running time indicates that direct CPU usage might have contributed to the duration of the critical path. Time spent in ready mode indicates that other threads contribute to the duration of the critical path by preventing a thread on the critical path from executing. Time spent waiting points to I/O, timers, or other threads and processes on the critical path for which the current thread was waiting.

Each thread that readied the current thread is probably another link in the critical path, and can also be analyzed until the duration of the critical path is accounted for.

Procedure: Finding the Critical Path in WPA

The following procedure assumes that you have identified an activity in the Activities graph for which you want to find the critical path.

You can identify the process that completed the activity by hovering over the activity in the Activities graph.

Add the CPU Usage (Precise) graph. Zoom to the affected activity, and apply the Utilization by Process, Thread preset.

Right-click the column headers and make the ReadyThreadStack and CPU Usage (ms) columns visible. Remove the Ready (us) [Max] and Waits (us) [Max] columns.

Expand the target process and sort it respectively by CPU Usage (ms), Ready (us) [Sum], and Waits (us) [Sum].

Search for the NewThreadIds in the process that has the highest amount of time spent in Running, Ready, or Waiting state.

Threads that spend significant time in the Running or Ready states might represent direct CPU usage on the critical path. Note their thread IDs.Threads that spend significant time in the Waiting state might be waiting on I/O, a timer, or on another thread in the critical path.

To discover what the threads were waiting for, expand the NewThreadId group to display the ReadyThreadStack.

Expand [Root].

Stacks that begin with KiDispatchInterrupt are not related to another thread. To determine what the thread was waiting for in these stacks, expand KiDispatchInterrupt and view the functions on the child stack. IopfCompleteRequest indicates that the readied thread was waiting for I/O. KiTimerExpiration indicates that the readied thread was waiting for a timer.

Expand stacks that do not begin with KiDispatchInterrupt until you see a ReadyingProcess and a ReadyingThread. If the process is already expanded, expand the NewThreadId that corresponds to the ReadyingThread. Repeat this step until you find a thread that is running, ready, waiting for another reason, or waiting on a different process. If the thread is waiting on a different process, repeat this procedure by using that process.

Example

This example presents a delay in the Fast Startup Resume Explorer Init activity. A search in the Issues pane shows that seven delay-type issues are reported for this activity. Each of these issues can be reviewed as a segment of the critical path. The following key segments are identified:

Thread 3872 of process TestBootStrapper.exe (3024) is preempted for 2.1 seconds.

Thread 3872 of process TestBootStrapper.exe (3024) uses 1 second of CPU time.

Thread 3872 of process TestBootStrapper.exe (3024) flushes a registry hive for 544 milliseconds.

Thread 3872 of process TestBootStrapper.exe (3024) sleeps for 513 milliseconds.

Threads 4052 and 4036 of Explorer.exe read from disk, causing a 461 millisecond delay.

Thread 3872 of process TestBootStrapper.exe (3024) starves for 187 milliseconds.

Thread 3872 of process TestBootStrapper.exe writes 3.5MB to disk, causing a 178 millisecond delay.

The issues show that this activity was delayed by 5.2 seconds. These delays contribute a large proportion of the activities overall 6.3 second duration. The TestBootStrapper.exe application is primarily responsible for the delay, primarily because it preempted other processing tasks.

Investigate Issues in the Critical Path

Zoom to the affected region and add the ReadyThreadStack and CPU Usage (ms) columns.

In this case, Explorer.exe is the process that completes the activity. Expand the explorer.exe process and sort it respectively by CPU Usage (ms), Ready (us) [Sum], and Waits (us) [Sum], as shown in the following figures:

Figure 25 Activity by CPU Usage (ms)

Figure 26 Activity by Ready (us)

Figure 27 Activity by Waits (us)

Sorting by the CPU Usage (ms) column shows a top child row of 299 milliseconds. Sorting by the Ready (us) [Sum] column shows a top child row of 46ms. Sorting by the Waits (us) [Sum] column shows a top child row of 5749 milliseconds and a second row of 4902 milliseconds. Because these rows contribute significantly to the delay, you should investigate them further.

Expand the stacks to reveal the readying threads, as shown in the following figures:

Figure 28 Readying Process and Readying Thread for a Thread

Figure 29 Readying Process and Readying Thread for another Thread

In this example, the first thread spends most of its time waiting for the RunOnce.exe process to exit. You should investigate why the RunOnce.exe process is taking so much time to complete. The second thread is waiting on the first thread, and is probably an insignificant link in the same wait chain.

Repeat the steps in this procedure for RunOnce.exe. The primary contributing column is Waits (us), and it has four possible contributors.

Expand each contributor to see that the first three contributors are each waiting on the fourth contributor. This situation makes the first three contributors insignificant to the wait chain. The fourth contributor is waiting on another process, TestBootStrapper.exe.

This scenario is shown in Figure 30 Readying Process and Readying Thread for a Thread in RunOnce.exe:

Figure 30 Readying Process and Readying Thread for a Thread in RunOnce.exe

Repeat the steps in this procedure for TestBootStrapper.exe. The results are shown in the following three figures:

Figure 31 Threads by CPU Usage (ms)

Figure 32 Threads by Ready (us)

Figure 33 Threads by Waits (us)

Thread 3872 spent approximately 1 second running, 2 seconds ready, and 1.3 seconds waiting. Because this thread is also the readying thread for thread 3872, the running and ready times probably contribute to the delay. The assessment reports the following issues whose times match the delays:

Thread 3872 of process TestBootStrapper.exe (3024) is preempted for 2.1 second.

Thread 3872 of process TestBootStrapper.exe (3024) starves for 187 milliseconds.

Thread 3872 of process TestBootStrapper.exe (3024) uses 1 second of CPU time.

To find other contributing issues, view the event for which thread 3872 was waiting. Expand ReadyThreadStack to view contributors to the 1.3 seconds of waiting, as shown in Figure 34 Contributors to Wait Time:

Figure 34 Contributors to Wait Time

KiRetireDpcList is typically I/O-related and KiTimerExpiration is a timer. You can view how the I/Os and timer were initiated by removing the ReadyThreadStack and then viewing the NewThreadStack. This view shows three related functions, as shown in Figure 35 I/Os and Timer on NewThreadStack:

Figure 35 I/Os and Timer on NewThreadStack

This view discloses the following details:

Thread 3872 of process TestBootStrapper.exe (3024) flushes a registry hive for 544 milliseconds.

Thread 3872 of process TestBootStrapper.exe (3024) sleeps for 513 milliseconds.

Thread 3872 of process TestBootStrapper.exe writes 3.5MB to disk, thereby causing a 178 millisecond delay.

When you started to investigate the critical path, you analyzed the most significant wait cause in Explorer.exe and disregarded any parts of the critical path that occurred after that wait cause. To capture this previously disregarded section of the critical path, you must look at the timeline. Add CPU Usage (Precise) and apply the Timeline by Process, Thread preset.

Filter to include only the processes identified as part of the critical path. The resulting graph is shown in Figure 36 Critical Path Timeline:

Figure 36 Critical Path Timeline

Figure 36 Critical Path Timeline shows that Explorer.exe performed more work after it stopped waiting for RunOnce.exe. Zoom to the time period after the previously-analyzed wait chain and perform another analysis. In this case, analysis reveals a large number of threads that are internal to Explorer.exe and no clear trace through the critical path. In this case, further analysis is not likely to yield actionable insights.

Direct CPU Usage

Activities are often delayed because a thread on the critical path uses significant CPU time. By using the thread state model, you can see that this problem is characterized by a thread on the critical path that spends an exceptional amount of time in the Running state. On some hardware, this heavy CPU usage can contribute to delays.

Problem Identification

Many assessments use heuristics to identify direct CPU usage-related problems. Significant CPU usage on the critical path is reported as an issue in the following form:

CPU use by process P delays the impacted activity A for x seconds

Where P is the process that is running, A is the activity, and x is the time in seconds.

If these issues are reported for an activity that incurs delays, direct CPU usage might be the cause.

Investigate Direct CPU Usage

You can manually identify the problem by looking for individual CPUs that incur 100% CPU usage in the CPU Usage (Sampled) graph.

Zoom to an area of interest in the graph and select the Utilization by Process and Thread preset.

By default, the table displays rows at the top that have the highest aggregate CPU usage. These threads also display at the top of the CPU Usage (Sampled) graph.

NoteВ В On a system that has multiple processors, a thread that uses 100% of a single processor will appear to be consuming 100/(number of logical processors). On this kind of system, only the virtual idle thread (PID 0, TID 0) can show a greater processor utilization than 100/(number of logical processors). If the processes and threads that consume the most CPU correspond to any threads in the critical path, direct CPU usage is probably a factor.

Example of Assessment-Reported Direct CPU Usage issue

CPU use by the TestUM.exe process (4024) delays the impacted activity, Fast startup shutdown process TestIM.exe, for 2.1 seconds. This example is shown in Figure 37 Thread 3208:

Figure 37 Thread 3208

Investigation

After you discover that direct CPU usage contributes to a delay on the critical path, you must identify the specific modules and functions that contribute to the delay.

Technique: Review an Assessment-Reported Direct CPU Usage Issue

You can expand an assessment-reported direct CPU usage issue to display the critical path that is impacted by the direct CPU usage. If you expand the node that is associated with the CPU usage, the stacks that are associated with the CPU usage and associated modules will display. This view is shown in Figure 38 Expanded CPU Usage Segment:

Figure 38 Expanded CPU Usage Segment

Technique: Manually Explore the Stacks of a Direct CPU Usage Issue

If the assessment did not report an issue, or if you require additional verification, you can use the CPU Usage (Sampled) graph to manually collect information on the modules and functions that are involved in a CPU usage issue. To do this, you must zoom to the area of interest and view the stacks that are sorted by CPU Usage.

Manually Explore the Stacks of a Direct CPU Usage Issue

On the Trace menu, click Load Symbols.

Zoom the timeline to display only the portion of the critical path that is affected by the CPU issue.

Apply the Utilization by Process and Thread preset.

Add the Stack column to the display, and then drag this column to the right of Thread ID (left of the bar).

Expand the process and thread to display the stack trees.

The rows in the stack are sorted in descending order by % Weight of CPU Usage. This puts the most interesting stacks on top. As you expand the stacks, watch the % Weight column to make sure that your focus remains on the rows that have the highest usage.

To extract a copy of the stack, select all the rows, right-click, and click Copy Selection.

Resolution

You can apply remedies at both the configuration and component levels to resolve high CPU usage.

Direct CPU usage has higher impact on computers that have lower-end processors. In these cases, you can add more processing power to the computer. Or, you might be able to remove the problem modules from the critical path or from the system. If you can change the components, consider a redesign effort to achieve one of the following results:

Remove the CPU-intensive code from the critical path

Use more CPU-efficient algorithms

Defer or cache work

Thread Interference

CPU usage by threads that are not on the critical path (and that might be unrelated to the activity), can cause threads that are on the critical path to be delayed. The thread state model shows that this problem is characterized by threads on the critical path that spend an unusual amount of time in the Ready state.

Problem Identification

Many assessments use heuristics to identify interference-related problems. These are reported in one of the following two forms:

Process P is starved. The starvation causes a delay to the impacted activity A of x ms.

Process P is preempted. The preemption causes a delay to the impacted activity A of x ms.

Where P is the process, A is the activity, and x is the time in ms.

The first form reflects interference from threads at the same priority level as the thread on the critical path. The second form reflects interference from threads that are at a higher priority level than the thread on the critical path.

If these types of issues are reported for a delayed activity, thread interference can be the cause. You can use the CPU Usage (Precise) graph to manually identify the problem.

Identify Thread Interference Issues

Zoom to the interval and apply the Utilization by CPU preset. A utilization of 100% across all CPUs indicates an interference issue.

Apply the Utilization by Process, Thread preset and sort by the first Ready (us) column. (This is the column that includes the Sum aggregation.)

Expand the process of the activity that is affected and look at the Ready time for threads on the critical path. This value is the maximum time that the delay can be reduced by resolving any thread Interference issue. A value with a magnitude significant relative to the delay being investigated indicates that a thread interference problem exists.

Figure 39 CPU Utilization is Near 100% and Figure 40 Thread Interference Problem represent this scenario:

Figure 39 CPU Utilization is Near 100%

Figure 40 Thread Interference Problem

Investigation

After the issue is identified, you must determine why the affected thread spent so much time in the Ready state.

Technique: Determine Why a Thread Spent Time in the Ready State

You can use the CPU Usage (Precise) graph to determine why a thread spent time in the Ready state. You must first determine whether the thread is restricted to certain processors. Although you cannot directly obtain this information, you can examine the CPU usage history of a thread during periods of high CPU utilization. This is the period when threads tend to frequently switch between processors.

Determine a ThreadвЂ™s Processor Restrictions

Zoom to the affected region.

Add the CPU Usage (Precise) graph and apply the Utilization by Process, Thread preset.

Use the Advanced dialog to add a Cpu column that has a Unique Count aggregation mode to the right of NewThreadId.

Filter the graph to show only the threads in which you are interested.

The value in the Cpu column reflects the number of processors on which the thread ran during the current time interval. During periods of 100% CPU utilization, this number approximates the number of processors on which this thread is allowed to run. If the value is less than the number of available processors, the thread is probably restricted to certain CPUs.

Figure 41 Restricted Threads provides an example of this graph:

Figure 41 Restricted Threads

After you know a threadвЂ™s processor restrictions, you can determine what preempted or starved the thread. To do this, you must identify the intervals that the thread spent in the Ready state and then examine what other threads or processes were running during those intervals.

Determine what Preempted or Starved the Thread

Construct a graph that shows when the thread was in the Ready state and apply the Utilization by Process, Thread preset.

Open the View Editor, click Advanced, and select the Graph Configuration tab.

Set Start Time to ReadyTime (s) and set the Duration to Ready (us), as shown in Figure 42 Ready Time Columns. Click OK.

Figure 42 Ready Time Columns

In the View Editor, replace the CPU Usage (%) column with the Ready (us) [Sum] column.

Select the thread of interest to produce a graph that is similar to Figure 43 Ready Time Graph:

Figure 43 Ready Time Graph

In this case, the thread spent significant time in the Ready state. To determine its typical priority, add an Average aggregation to the NewInPri column.

In this case, the threadвЂ™s average priority is exactly 8. This number indicates that it is probably a background thread that never receives priority elevations.

After the average priority is known, look at the CPU activity for the CPUs on which the thread is allowed to run.

In this case, the thread was determined to have affinity for CPU 1 only.

Add another CPU Usage (Precise) graph and apply the Utilization by CPU preset. Select the relevant CPUs.

Open the Advanced view and add a filter for the priority that you found earlier to filter out that thread. This scenario is shown in Figure 44 Thread Filter:

Figure 44 Thread Filter

In Figure 45 CPU Usage, Ready Time, and Other Thread Activity, the top graph shows the CPU usage of thread 3548. The middle graph shows the time that the thread was ready, and the bottom graph shows activity on the CPUs on which the thread was allowed to run (in this case, Cpu1).

Figure 45 CPU Usage, Ready Time, and Other Thread Activity

Zoom into a region where the thread was ready, but did not run, for most of the time during that interval.

In the CPU Usage graph, add NewInPri to the left of the bar and examine the results.

Threads or processes that have priorities that are equal to the target thread priority show time that the thread was starved. Threads or processes that have higher priority than the target thread priority show time that the thread was preempted. You can calculate the total time that the thread was preempted by adding the times of all preemptive threads and actions.

Figure 46 Usage by Priority When Target Thread was Ready shows that 730ms of the thread time were preempted, and 300ms of the thread time were starved. (This figure is zoomed to a 1192ms interval.)

Figure 46 Usage by Priority When Target Thread was Ready

To determine which threads are responsible for the preemption and starvation of this thread, add the NewProcess column to the right of the NewInPri column and review the priority levels at which processes were running. In this case, the preemption and starvation were primarily caused by another thread in the same process and by TestResidentApp.exe. You can assume that these processes receive periodic priority elevations above their base priority.

Resolution

You can resolve preemption or starvation issues by changing the configuration or components. Consider the following remedies:

Remove the problematic processes from the system.

Adjust the base priority of the problematic processesвЂ¦

Change the time when the problematic processes run; for example, delay their start time to occur when the computer reboots.

If the problem components can be changed, redesign them to be less CPU-intensive or to run at a lower priority.

DPC/ISR Interference

When excessive processor time is consumed by running DPCs and ISRs, there might not be enough available CPU time left to run threads. This situation can cause similar delays to thread interference. When threads must complete operations at a regular high-frequency rate, such as in video playback or animation, interference by DPCs and ISRs can cause operational problems.

Problem Identification

Many assessments use heuristics to identify DPC/ISR-related problems. DPC/ISR activity is identified as suspicious when it is reported as an issue in the following form:

DPC D exceeds the threshold of m milliseconds x times during P. The n instances of this DPC run for a combined total of t milliseconds.

Where D is the DPC, m is the number of milliseconds that sets the threshold, x is the number of times that the DPC exceeded the threshold, P is the current process, n is the number of instances that the DPC ran, and t is the total time in milliseconds that the DPC ran over the threshold.

For example, the following issue is reported by an assessment:

DPC sdbus.sys!SdbusWorkerDpc exceeds the goal of 3.0 milliseconds 153 times during Media Engine Lifetime. The 153 instances of this DPC run for a combined total of 864 milliseconds

If this issue is reported for an activity that exhibits problem events or delays, DPC/ISR activity might be the cause.

Manually Identify DPC/ISR Interference

To manually identify DPC/ISR interference, open a trace in WPA and identify the problem events of interest. These are assessment-specific generic events such as Microsoft-Windows-Dwm-Core:SCHEDULE_GLITCH or Microsoft-Windows-MediaEngine:DroppedFrame.

Next to the graph of events, add the DPC/ISR Duration by CPU graph. If peaks in the DPC/ISR Duration by CPU graph line up with the problem events, DPC/ISRs might be a factor in causing the problems.

For additional data, zoom into the time period that occurs 100ms before several problem events display. If significant DPC/ISR activity displays on one or more processors in the 100ms region before the problem events occurred, you can conclude that the problem events were caused by the DPC/IRS activity.

To determine whether DPC/ISR interference is causing delays, zoom to a region that shows a running thread. Make a note of the CPU or CPUs on which this thread is running.

In the DPC/ISR graph, apply the DPC/ISR Duration by CPU preset and view the DPC/ISR activity on the relevant CPUs in that time range.

Figure 47 Problem Events and DPC/ISR Activity shows that thread 864 of iexplore.exe is relevant to the affected activity. Thread 864 is in the Running state on CPU2 for 10.65% of the time range in view. However, the DPC/ISR graph shows that CPU2 was busy executing DPC/ISRs for 10% of that time.

NoteВ В Most DPC/ISRs do not have as high an impact as that shown in this example.

Figure 47 Problem Events and DPC/ISR Activity

In Figure 48 DPC/ISR Unrelated to Problem Events, DPC/ISRs are shown to not be related to performance problems:

Figure 48 DPC/ISR Unrelated to Problem Events

In Figure 49 Delay caused by DPC/ISR Interference, DPC/ISRs are shown to cause performance problems:

Figure 49 Delay caused by DPC/ISR Interference

Investigation

After you determine that DPCs/ISRs are related to problems or delays, you must determine which specific DPCs/ISRs are involved and why they occur frequently or of run for an excessive length of time.

Technique: Review an Assessment-Reported DPC/ISR Issue

In assessment-reported DPC/ISR issues, you can expand the issue that displays the major processes that are preempted by the DPC or ISR. Expand the stack to view the DPC activity for the process that is most related to the impacted activity, as shown in, expand the stack to understand what the DPC was doing. Figure 50 Expanded DPC Stack shows the expanded stack:

Figure 50 Expanded DPC Stack

Technique: Find the Highest Duration DPCs/ISRs and Review the Stacks

If an assessment does not report the DPC/ISR to be an issue, you can use the DPC/ISR and CPU Usage (Sampled) graphs to get stack information for the most relevant DPCs. We recommend that you find a DPC/ISR of interest, note its module and function, and then find the samples in the CPU Usage (Sampled) graph to get complete stack information.

Find the Highest Duration DPCs/ISRs and Review the Stacks

Zoom to the interval of interest.

In the DPC/ISR graph, select the preset DPC/ISR Duration by Module, Function.

If symbols are loaded, DPC/ISR events are sorted by total duration and are then broken down by Module and Function. The top rows in the list contain the DPC/ISR events that probably caused the event problems. Make a note of the module and function names.

In the CPU Usage (Sampled) graph, select the Utilization by Process preset. By default, this preset hides DPC/ISR activity.

Open the View Editor, and click Advanced.

On the Filter tab, change the Hide rows that match the filter setting to Keep rows that match the filter. This will enable DPC/ISR activities to be display.

Remove the Process column and add the Stack column to view DPCs/ISRs sorted by stack.

Clear the current row selection.

Right-click a cell in the Stack column and then click Find in this column.

Enter a module and function that you noted in Step 2 of this procedure.

Check Add to current selection, and click Find All to select all instances of the function.

After all the rows are selected, right-click and click Butterfly/View Callees.

This view shows the activities of this particular function, sorted by total duration. The view is similar to a stacks display in the detailed view of an assessment-reported issue. The Weight column approximates the inclusive time that is spent by each function on the stack, in milliseconds.

This view is shown in Figure 51 Callees of a DPC Sorted by Approximate Duration:

Figure 51 Callees of a DPC Sorted by Approximate Duration

Technique: Review Long-Running DPCs/ISRs

The total duration of DPCs/ISRs is important, but long-running individual DPCs/ISRs are more likely to cause delays. In the DPC/ISR graph, the Inclusive Duration (ms) column, sorted in descending order, displays the maximum durations of individual DPC/ISRs. The preset Long DPCs/ISRs that is available in some assessment profiles lets you filter this view to display only DPCs/ISRs that have an inclusive duration that is greater than 1ms.

NoteВ В If this preset is not available, you can open the View Editor, Advanced section, to add a filter.

Resolution

DPC/ISR activity often reflects a hardware or software problem that must be corrected at the hardware or component level. At a configuration level, you can replace the hardware or upgrade the related driver to a fixed version. At a component level, hardware and drivers should follow best practices for DPCs/ISRs from MSDN, and should use threaded DPCs when possible. Threaded DPCs do not run at the dispatch level on client editions of Windows. For more information about best practices for DPCs/ISRs, see Guidelines on ISR and DPC behavior and Introduction to Threaded DPCs.

Column	Details

Thread cpu time windows

GetThreadTimes function (processthreadsapi.h)

Syntax

Parameters

Return value

Remarks

How to get the cpu usage per thread on windows (win32)

7 Answers 7

Efficient way of getting thread CPU time using JMX

5 Answers 5

OS thread scheduling and cpu usage relations

2 Answers 2

CPU Analysis

Background

Processor Power Management

Processor Usage Management

Processes and Threads

Priority

Ideal Processor and Affinity

Quantum

State

DPCs and ISRs

Windows ADK Tools

WindowsВ ADK Assessment Results Files

WPA Graphs

CPU Idle States Graph

State by Type, CPU

State Diagram by Type, CPU

CPU Frequency Graph

CPU Usage (Sampled) Graph

Utilization by CPU

Utilization by Priority

Utilization by Process

Utilization by Process and Thread

CPU Usage (Precise) Graph

Timeline by CPU

Timeline by Process, Thread

Usage by Priority at Context Switch Begin

Utilization by CPU

Utilization by Process, Thread

DPC/ISR Graph

[DPC,ISR,DPC/ISR] Duration by CPU

[DPC,ISR,DPC/ISR] Duration by Module, Function

[DPC,ISR,DPC/ISR] Timeline by Module, Function

Stack Trees

Techniques

Define the Scenario and the Problem

Identify the Components and the Time Period

Create a Model

Use the Model to Identify Problems, and then Investigate Root Causes

Advanced Technique: Wait Analysis and the Critical Path

General Approach to Finding the Critical Path

Direct CPU Usage

Problem Identification

Investigation

Resolution

Thread Interference

Problem Identification

Investigation

Resolution

DPC/ISR Interference

Problem Identification

Investigation

Resolution