Search the Catalog
Windows 2000 Performance Guide

Windows 2000 Performance Guide

Help for Administrators and Application Developers

By Mark Friedman & Odysseas Pentakalos
January 2002
1-56592-466-5, Order Number: 4665
718 pages, $44.95

Chapter 5
Multiprocessing

As discussed in the last chapter, one sure route to better performance is to buy denser microprocessor chips, which have more logic packed into less space and therefore run faster. Intel founder Gordon Moore's Law, which says that microprocessor density and speeds will double every 18-24 months or so, has not let us down over the last 20 years. If you wait long enough, perhaps your performance problems will just go away with the next generation of computer chips! Another proven technique is multiprocessing, building computers with two, four, or more microprocessors, all capable of executing the same workload in parallel. Instead of waiting another 18 months for processor speed to double again, you might be able to take advantage of multiprocessing technology to double or quadruple your performance today. If you have a workload that is out of capacity on a single-processor system, a multiprocessor configuration running Windows 2000 may be the only reasonable alternative that offers you any hope of relief from these capacity constraints today.

Multiprocessing technology lets you harness the power of multiple microprocessors running a single copy of the Windows 2000 operating system. Enterprise-class server models with multiple CPUs abound. When is a two- or four-way multiprocessor solution a good answer for your processing needs? How much better performance should you expect from a server with multiple engines? What sorts of workloads lend themselves to multiprocessing? These are difficult questions to answer, but we try to tackle them here. For even bigger problems, you might even want to consider more exotic solutions where multiple multiprocessors are clustered to work together in parallel to power your web site, for example. However, clustered Windows 2000 solutions are beyond the scope of this book.

In this chapter, we introduce the performance considerations relevant to multiprocessors running Windows 2000. Per our usual method, we attempt to construct a conceptual framework that will allow you to formulate answers to these performance and capacity planning questions for your workload. First, we introduce multiprocessor technology and see how it works, then discuss how it is implemented in Windows 2000 and what performance trade-offs are involved. We also introduce the measurement data available under Windows 2000 that focuses on multiprocessor performance, and look at those applications that can best take advantage of multiprocessing for better performance. Finally, we discuss configuration and tuning options that are available for Windows 2000 multiprocessors.

Although many applications take good advantage of multiprocessors under Windows 2000 Professional, we concentrate on Windows 2000 Server workloads exclusively in this chapter, as that is where multiprocessing configurations are most common. These also tend to be relatively expensive systems, so understanding the performance ramifications of multiprocessors has a bottom line component associated with it. Again, we focus specifically on Intel hardware (although most of the general discussion applies no matter what your hardware platform is). By their very nature, multiprocessors are not simple systems, and, consequently, the issues and performance trade-offs that need to be understood can be quite complex. Unfortunately, there is no way to avoid complexity here, and frankly, the performance considerations for multiprocessing and parallel processing systems are so involved that we only scratch the surface of this fascinating topic. Many people have devoted their professional careers to the design of high-performance parallel systems, and there is a great body of published research and literature available to anyone with an interest in this subject. Suggestions for further reading are included in the bibliography.

Multiprocessing Basics

Very powerful multiprocessor (MP) hardware configurations are widely available today from many PC server manufacturers. Beginning as far back as the 486 processors, Intel hardware has provided the basic capability required to support multiprocessing designs. However, it was not until the widespread availability of the P6, known commercially as the Intel Pentium Pro microprocessors, that Intel began making processors chips that were specifically built with multiprocessing servers in mind. The internal bus and related bus mastering chip sets that Intel built for the P6 were designed to string up to four Pentium Pro microprocessors together. As we write this chapter, hardware manufacturers are bringing out high-end microprocessors based on the Intel Pentium III, specifically using the flavor of microprocessor code-named Xeon that is designed for 4-, 8-, and 16-way multiprocessors. Anyone interested in the performance of these large-scale, enterprise-class Windows 2000 Servers should be aware of the basic issues in multiprocessor design, performance, and capacity planning.

Shared-Memory Multiprocessing

The specific type of multiprocessing implemented using P6, Pentium II, III, and IV chips is generally known as shared-memory multiprocessing. (There is no overall consensus on how to classify various parallel processing schemes, but, fortunately, most authorities at least can agree on what constitutes a shared-memory multiprocessor!) In this type of configuration, the processors operate totally independently of each other, but they do share a single copy of the operating system and access to main memory (i.e., RAM). A typical dual processor shared-memory configuration is illustrated in Figure 5-1, with two P6 processors that contain dedicated Level 2 caches. (They each have separate built-in Level 1 caches, too.) A two-way configuration simply means having twice the hardware: two identical sets of processors, caches, and internal buses. Similarly, a four-way configuration means having four of everything. Having duplicate caches is designed to promote scalability, since the cache is so fundamental to the performance of pipelined processors.

Figure 5-1. A shared-memory multiprocessor
Fig 1

The processors also share a common bus that is used to access main memory locations. This, obviously, is not so scalable, and according to experienced hardware designers, this shared component is precisely where the bottleneck in shared-memory designs is often found.

The idea behind shared-memory multiprocessors looks pretty simple, and if it were not for the complex, internal pipelined microarchitecture of these processors, it would be. We recall that memory locations have to be loaded into cache memory before they can be accessed by the microprocessors in the usual, pipelined fashion. As a practical matter, data from memory can be resident in more than one processor cache at the same time. This leads to problems maintaining the integrity of shared data stored in memory. For instance, if Thread A running on CPU 0 changes location x'f00a 0dc4' in memory, then Thread B, executing on CPU 1 and trying to retrieve this data, must be able to access the most current copy from shared memory, rather than use a stale version from its private cache. This is generally known as the cache coherence problem, and it has several important performance implications, discussed in a moment.

Operating system support

Each processor in a multiprocessor is capable of executing work independently of the others. Separate, independent threads may be dispatched, one per processor, and run in parallel. Only one copy of the Windows 2000 operating system is running, controlling what runs on all the processors. From a performance monitoring perspective, you will see multiple instances of the processor object reported in both Taskman (as illustrated in Figure 5-2) and System Monitor.

Figure 5-2. Measurement support for multiprocessors in Task Manager
Fig 2

The specific type of multiprocessor support offered beginning in Windows NT 4.0 is known as symmetric multiprocessing, often abbreviated as SMP. Symmetric in this context, means that every thread is eligible to execute on any processor. Prior to NT 4.0, Windows NT supported only asymmetric multiprocessing. In an NT 3.5x machine, interrupts could only be processed on CPU 0. CPUs 1, 2, 3, could only run user and kernel code, never interrupt service routines (ISRs) and Deferred Procedure Calls (DPCs). This asymmetry ultimately limits the scalability of NT 3.5x multiprocessor systems, because the CPU 0 engine is readily overloaded under some workloads while the remaining microprocessors are idling. In an SMP, in theory, all the microprocessors should run out of capacity at the same time. One of the key development projects associated with the NT 4.0 release was changes to the kernel to support SMPs. In addition, the development team fine-tuned the OS code to run much better in a multiprocessing environment. (Later in this chapter, we discuss some of the coding techniques that are appropriate in a multiprocessing environment.) Windows 2000 also incorporates further improvements inside the operating system to boost performance on large n-way multiprocessor configurations.

The SMP support available in Windows NT 4.0 and above normally allows any processor to process any interrupt, as illustrated in Figure 5-3. This performance data depicts a two-way symmetric multiprocessor system running NT 4.0 Server. (Be careful: the vertical axis scale was adjusted down to a maximum of 30 to make the chart data easier to decipher.) The two processor instances of % Privileged Time and % Interrupt Time counters are shown. CPU 0 and 1 Privileged Time are the top two lines. You have to look carefully to see that these are two lines because they are so nearly identical on this SMP. CPU 0 and 1 Interrupt Time are shown at the bottom; again, these two lines are virtually identical. The processing workload is roughly balanced across both processors, although the load does bounce back and forth a bit, depending on what threads happen to be ready to run.

As should be evident from these pictures, threads are dispatched independently in Windows 2000 so that it is possible for a multithreaded application to run in parallel on separate processors. Applications designed for Windows 2000 take advantage of this architecture by creating multiple threads, as we saw in Chapter 3. The operating system creates separate Idle threads that are dispatched on the appropriate idle processor when there is no other work to do, so that it can account for processor utilization on a per-processor basis. The mechanism for keeping track of processor usage at the thread and process levels is essentially unchanged. No matter what processor is selected, a thread's CPU time is maintained when dispatched. The various Thread % Processor Time counters are then rolled up to the Process % Processor Time counter in the same fashion. By default, System Monitor will not report a value for the Process % Processor Time counter that is greater than 100%. But the default behavior of Sysmon can be modified by specifying a CapPercentsAt100 value of zero under the Registry key HKEY_CURRENT_USER\Software\Microsoft\PerfMon to allow Sysmon to report values of the Process % Processor Time greater than 100%. You should make this change on all multiprocessors.

The bottom of Figure 5-3 shows the amount of CPU time consumed servicing interrupts on the two processors. The amount of time spent processing interrupts per processor is roughly equal, though there is some degree of variation that occurs naturally. This is not always the most efficient way to process interrupts, by the way. Having one type of ISR or DPC directed to a single processor can have a positive impact on performance if the processor that runs the DPC code, for instance, is able to keep it in cache, rather than being forced to fetch instructions from memory. Similarly, the Windows 2000 Scheduler tries to dispatch a ready thread on the same processor where it recently ran for that very same reason. A thread is said to have an affinity for the processor where it was most recently dispatched. Processor affinity in a multiprocessor is discussed later.

Figure 5-3. Symmetric multiprocessing in Windows 2000
Fig 3

Thread scheduling

Since multiple programs are loaded in Windows 2000 memory anyway, it seems like it should be a relatively simple matter for Windows 2000 to run two (or three or four or eight) programs at a time. Unfortunately, shared-memory and coordination issues associated with the need to maintain cache coherence make things considerably more complicated. Even though the separate processors operate largely independently of each other, there are some occasions where the hardware must communicate and coordinate. There are also times when it is necessary to prevent multiple independent processors from running in parallel. Indeed, shared memory multiprocessors require some mechanism to ensure that threads execute serially as they access and update global shared-memory data structures. A brief look at how the Windows 2000 Scheduler works in a multiprocessor configuration will illustrate the data integrity concerns. It will also serve both to highlight the cache coherence problem and put the performance implications of multiprocessing into perspective.

The Windows 2000 OS kernel is multithreaded, which means it is capable of working in parallel if the hardware is available. During installation, the presence of a multiprocessor is detected and a multiprocessing version of ntoskrnl.exe, the OS kernel, is loaded. During OS initialization, about 20 additional kernel threads are allocated for each additional processor that is detected. Windows 2000 internal data structures provide native support for up to 32 processing engines. Meanwhile, Windows 2000 maintains a single copy of the Scheduler's dispatching queue for ISRs, DPCs, and kernel and normal program threads that is shared across all processors. There is only one Dispatcher Ready Queue, even on a multiprocessor.

Processor affinity

Logically, the structure of the Ready Queue and the relative priority of threads waiting to execute is identical whether Windows 2000 is executing on an single processor or on multiple processors. The main difference is that multiple threads can run concurrently on a multiprocessor, a little detail that leads to many complications. In the regular course of events, the Windows 2000 Scheduler selects the highest priority waiting thread to run on each available processor. In addition, certain execution threads may have an affinity to execute on specific processors. Windows 2000 supports both hard affinity, where a given thread is eligible to run only on specific processors, and soft affinity, where Windows 2000 favors scheduling specific threads on specific processors, usually for performance reasons.

Hard affinity is specified at the process and thread level using a processor affinity mask. The Win32 API calls to accomplish this are straightforward. First, a thread issues a GetProcessAffinityMask call referencing a process handle, which returns a 32-bit SystemAffinityMask. Each bit in the SystemAffinityMask represents a configured processor. Then, the program calls SetProcessAffinityMask with a corresponding 32-bit affinity mask that indicates the processors that threads from the process can be dispatched on. Figure 5-4 illustrates the use of this function in Taskman, which allows you to set a process's affinity mask dynamically, subject to the usual security restrictions that allow Taskman to operate only on foreground-eligible processes by default. There is a corresponding SetThreadAffinityMask call to override the process settings for specific threads. Once hard affinity is set, threads are eligible to be dispatched only on specific processors. This chapter concludes with a section discussing when setting a processor affinity mask for certain applications can be very desirable for performance reasons. Along the way, it is necessary to tackle several related multiprocessing topics to consider this important configuration and tuning strategy.

Figure 5-4. Setting a process's processor affinity mask using Taskman
Fig 4

Suppose that following an interrupt, an application program thread becomes ready to run, and multiple processors are idle. First, Windows 2000 must select the processor on which to run the ready thread. This decision is based on performance. If the thread was previously dispatched within the last Scheduler quantum (you remember the Scheduler time-slice quantum, don't you?), Windows 2000 attempts to schedule the thread on that processor, which works as long as the current thread has a higher priority than the thread that is already running there or the ideal processor happens to be idle. This is known as soft affinity. By scheduling the thread back on the same processor where it just ran, Windows 2000 hopes that a good deal of the thread's code and data from the previous execution interval are still present in that processor's cache. The difference in instruction execution rate between a cache "cold start," when a thread is forced to fault its way through its frequently accessed code and data, and a "warm start," when the cache is preloaded, can be substantial. However, Windows 2000 is willing to schedule the waiting thread on a different processor if the desired processor is busy with a higher-priority task.

Serialization and locking

The Scheduler code in Windows 2000 can run in multiple threads concurrently. Therefore, Windows 2000 must ensure that two processors do not try and run the same ready thread at the same time. This is just one example of a more general integrity issue in a multiprocessor. Because shared-memory data structures can be accessed by multiple threads concurrently, a serialization or locking mechanism is necessary to ensure that independent processors do not, for example, select the same thread off the Ready Queue at the same time. To prevent this, the shared-memory locations associated with the Windows 2000 Scheduler's dispatching queue are locked so that threads are only added or removed from the Ready Queue one at a time. We look at this serialization mechanism more closely in a moment. Serialization services are one of the key functions performed by the Windows 2000 operating system. (The current fashion in OS design is that about the only function performed in the low-level operating system microkernel is serialization. That is how fundamental serialization and locking are!) Windows 2000 must ensure that kernel threads running on multiple processors serialize before accessing its shared internal data structures. Windows 2000 also provides serialization services to application programs with similar requirements.

Shared-memory multiprocessor programming model

One reason shared-memory multiprocessors are so popular is that, with minor exceptions, programs that run on uniprocessors can run unmodified on shared-memory multiprocessors. In addition, almost any multithreaded application will likely benefit from having a choice of processors to run on--any thread can run on any available processor in an SMP. However, extreme care must be exercised to prevent multiple threads in your application from accessing and changing shared data structures in parallel. To help developers, the Windows 2000 operating system provides a complete set of platform-independent serialization services, courtesy of the HAL, for use by both application programs and kernel mode drivers. These thread synchronization services are used to ensure that Windows 2000 applications are "thread-safe," i.e., capable of executing correctly on a multiprocessor. This set of thread synchronization services allows programmers to implement complex, multithreaded applications without having to understand the underlying hardware mechanisms on a multiprocessor, which are platform-specific. Understanding the performance ramifications of shared-memory multiprocessors does require some knowledge of the hardware, however.

Let's look at the execution of a multithreaded program running on a multiprocessor. Each ready and waiting thread runs independently, acquiring a processor when one is free and the specific thread is the highest priority task in the dispatcher Schedule Ready Queue. Threads from within a single process may need to communicate with each other from time to time and are likely to share access to common, global variables. While one thread is actively changing global values, it is necessary to block other threads from accessing them until after the modifications are complete. Imagine a program like the one illustrated in Figure 5-5 that operates on some common list structures and shares a copy of a common piece of code to access and update these shared data structures. Programmatically, it is necessary to serialize access to common data structures that can be referenced by multiple threads executing in parallel. Using serialization, one thread at a time gains exclusive access to these data structures in order to change them, potentially forcing other threads to wait if they require access to the data structures being changed. Serialization services are used to protect shared data structures so that modifications to shared data on a multiprocessor proceed one at a time. In Figure 5-5, Thread 1 obtains the lock to a critical section and proceeds to execute it, while Thread 2 blocks waiting for the lock.

Figure 5-5. Critical sections
Fig 5

Serialization services to ensure that only one thread at a time can access shared-memory locations are at the heart of Windows 2000, just like any multiprocessor operating system kernel. The precise mechanism for serialization is processor-specific, located within the HAL. The Intel instruction set provides a LOCK prefix, which indicates that the instruction that follows serializes across all processing engines. From a hardware perspective, the LOCK prefix wraps a shared-memory bus LOCK# signal around successive memory fetch and store commands. The LOCK prefix is designed to be coded along with an instruction like CMPXCHG (which works on a memory word) or BTS (which operates on a bit) to test and set a value in memory in a single instruction execution cycle. (The frequently used XCHG serializing instruction for operating on shared memory has an implicit LOCK associated with it.) Instructions coded with the LOCK prefix are guaranteed to run uninterrupted and gain exclusive access to the designated memory location for the duration of the operation.

In programming terms, serialization is commonly referred to as locking a particular section of code involved in updating shared data structures so that it can only be executed by one thread at a time. A shared code fragment that requires locking because it updates shared memory is called a critical section. To gain access to the code inside a critical section, a thread must first acquire the lock word associated with it. This is a designated memory location that, for example, contains a zero value if no thread is executing the code, and a nonzero value if some thread is.

A thread attempting to enter a critical section first tests the value of the lock, not a trivial matter when a multiprocessor is serving two or more threads independently. If the thread determines that the critical section is not locked, it is free to enter and execute the critical section of code. As the thread enters a critical section, it sets the lock word. The fact that the lock word is set acts as a signal to any other thread trying to gain access to the routine.

The LOCK prefix in front of a CMPXCHG instruction allows a program to test and set a lock value in memory in a single instruction. On a multiprocessor, this memory read and update is guaranteed to be atomic. That means the instruction runs uninterrupted and is immediately visible to all other processors with access to shared memory when it is executed. Another thread attempting to gain access to the locked critical section must wait until the lock is released.

When a thread holding the lock has completed its update, it issues another locked XCHG instruction to reset the lock word and allow blocked threads to enter the protected code. What to do about a thread that terminates while it holds a lock is one of the big problems programmers developing multithreaded applications face. It is easy to imagine scenarios where a thread executing a critical section fails, leaving the lock set. Surviving threads remain blocked and the entire process hangs. This is not a pretty sight.

Windows 2000 serialization services

The Windows 2000 HAL abstracts several serialization primitives that rely on the LOCK prefix, coded with a suitable instruction like CMPXCHG or XCHG. Programs running under Windows 2000 that need to lock data structures call OS systems services to perform serialization on their behalf so that programmers do not need to understand the underlying processor hardware details. Windows 2000 supports three Win32 serialization objects: critical sections (called just "sections" in System Monitor), semaphores, and mutexes. Using System Monitor, you can keep track of the number of sections, semaphores, and mutexes that have been created by looking at the corresponding counters in the Object object. Figure 5-6 illustrates the synchronization objects created when we launched Internet Explorer, Excel, Word, and Outlook from a desktop PC running Windows 2000 Professional. If you are writing your own applications that perform sophisticated multithreading, these counters may be of interest. Otherwise, you are unlikely to ever need to monitor these counters.

Figure 5-6. Monitoring sections, semaphores, and mutexes using Performance Monitor
Fig 6

The Windows 2000 serialization services make it relatively easy for a programmer to write code that will execute correctly on a multiprocessor. Not only does the programmer not have to deal with the underlying hardware, but the code is also portable across different hardware environments. A critical section object in Windows 2000 can be used to synchronize access to a shared code segment among multiple threads from a single process. A mutex (an abbreviation for mutual exclusion) object is a similar construct that can be used to synchronize threads from different processes. Mutexes are owned by threads and support behavior to signal any remaining threads that an owning thread has terminated without releasing the mutex, allowing for application recovery in this environment. Windows 2000 calls a mutex held by a terminated thread abandoned. Your stranded thread receiving a call-back when a mutex is abandoned allows your application to recover from this sort of catastrophic failure, which would otherwise cause it to hang. Finally, counting semaphores implement slightly more complicated semantics to keep track of, for example, multiple readers accessing shared data while a writer thread waiting for exclusive control of the critical section waits until the resource is totally free.

While the serialization services Windows 2000 provides make it easier to program multithreaded applications that execute correctly on a multiprocessor, building such applications is never easy. It is very difficult for most programmers to deal with multiple threads executing at the same time, and independent threads can interact in such complex ways that it is hard for even experienced programmers to anticipate them in advance.

Shared-Memory Multiprocessor Scalability

Shared-memory multiprocessors running SMP operating systems are the most common breed of multiprocessor. A great deal is known about hardware performance in this context. As discussed previously, any multithreaded application will likely benefit from having a choice of processors to run on, and in an SMP any thread can run on any available processor. Even single-threaded applications may benefit, since multiple applications can run in parallel. Because a dual processor system running Windows 2000 can dispatch two threads at a time, not just one, it seems reasonable to assume the dual processor configuration is twice as powerful as having a single engine to do the work. To see why this isn't exactly so, we will need to investigate a few aspects of shared-memory multiprocessor scalability.

Shared-memory multiprocessors have some well-known scalability limitations. They seldom provide perfect linear scalability. Each time you add a processor to the system, you do get a corresponding boost in overall performance, but with each additional processor the boost you get tends to diminish. It is even possible to reach a point of diminishing returns, where adding another processor actually reduces overall capacity. In this section we discuss several fundamental issues that impact multiprocessing scalability, including:

Understanding these sources of performance degradation in a shared-memory multiprocessor will help us in examining and interpreting the measurement data available on a Windows 2000 multiprocessor.

One thing to be careful about is that the processors in an SMP may look busy, but if you look inside the processors, you will find they are not performing as much productive work as you thought. The way to look inside, of course, is to use the Pentium counters. In the last chapter, we saw that the internal measurements of instruction execution rate (IER) generally track Processor % Processor Time very well on a single engine system. However, on a multiprocessor, internal IER and external processor busy measures may not correspond in the same expected way.

Figure 5-7, taken from a two-way multiprocessor, illustrates this essential point. The screen on the left shows the Task Manager histogram of processor utilization on this machine. Notice that both engines are running at near 100% utilization. This configuration contains two 200 MHz Pentium Pro machines. Remember, on a uniprocessor, a good rule of thumb is to expect performance at or near two machine cycles per instruction (CPI). This translates into capacity of about 100,000,000 instructions per second on a 200 MHz P6. The screen on the right shows actual measurements for one of these engines at only 12,345,000 instructions per second, or a CPI of about 16.7. This machine is delivering only about 12% of its rated performance--notice over 146 million resource-related stalls, too. Making sure that an expensive four- or eight-way multiprocessor server configuration is performing up to its capacity is not a trivial affair.

Figure 5-7. Multiprocessor scalability issues force you to look inside the processor
Fig 7

Speed-up factors

When we talk about scalability in the context of either multiprocessors or parallel processors, we are referring to our basic desire to harness the power of more than one processor to solve a common problem. The goal of a dual processor design is to apply double the CPU horsepower to a single problem and solve it in half the time. The goal of a quad processor is to apply quadruple the processing power and solve problems in one quarter the time. The term speed-up factor refers to our expectation that a multiprocessor design will improve the amount of time it takes to process some multithreaded workload. If a multiprocessing design achieved a speed-up factor of 1, then two processors would provide fully twice the power of a single engine. This would be perfect linear scalability, something shared-memory multiprocessors are just not capable of. A reasonable expectation is for a speed-up factor in the range of about 0.85. This means that two Intel processors tied together would be able to function at only about 85% efficiency. Together, the two would provide 1.7 times the power of a standalone processor. An improvement, but certainly a more marginal and less cost-effective one than the ideal. It turns out that Windows 2000 running on Intel hardware provides MP scalability in that range.

What happens when you add a third, fourth, or even more processors? The best models of shared memory multiprocessor performance suggest that the machines get progressively less efficient as you add more processors to the shared bus. However, with proper care and feeding, multiprocessor configurations with quite good scalability can be configured, although this requires skill and effort. We explore the configuration and tuning of multiprocessing workloads later in this chapter.

Microsoft reported in 1996 that the symmetric multiprocessing support built for Windows NT Version 4.0 sported a speed-up factor of 0.85. Figure 5-8 compares the theoretical prospects for linear speed-up in a multiprocessor design to the actual (projected) scalability of Windows NT Version 4.0 based on the measurements reported by Microsoft. The projection used here is a guess, of course, and your mileage may vary, but it is based on the formula Gunther[1] recommends for predicting multiprocessor scalability, and is consistent with a number of published benchmarking results.[2] Actual performance is very workload-dependent, as we discuss shortly. Figure 5-8 illustrates that actual performance of a multiprocessor running Windows NT falls far short of the ideal linear speed-up. In fact, beyond four multiprocessors, the projection is that adding more engines hardly boosts performance at all. Many published benchmark results of Windows NT multiprocessors evidence similar behavior. Moreover, after a certain point (more than 12 processors), adding additional engines actually degrades overall performance. The Windows NT multiprocessor scalability is typical of general-purpose operating systems--no worse, no better. To achieve anywhere near linear scalability requires highly engineered, special-purpose parallel processing hardware and complementary operating system services to match. We review some of the factors impacting shared-memory multiprocessor scalability in more detail later.

Figure 5-8. Theoretical linear scalability compared to actual projected scalability
Fig 8

Windows 2000 incorporates some further enhancements designed to improve multiprocessor scalability. Microsoft implemented a new HAL function called queued spin locks that exploits a new Intel instruction on the Pentium III. It is not clear just how much this new function will help on large-scale 8- and 16-way machines. Figure 5-8 suggests two possibilities, both reflecting a relatively marginal increase in multiprocessor scalability to 0.90 or possibly even 0.95.

To summarize, it is simply not possible to string processor after processor together and double, triple, quadruple, etc., the amount of total processing power available. The principal obstacle of shared-memory designs, which are quite simple from the standpoint of the programmer, is that they typically encounter a bottleneck in accessing shared-memory locations using the shared system memory bus. To understand the nature of this bottleneck, let's proceed to a discussion of the sources of performance degradation in a multiprocessor. We now look at the performance impact of serializing instructions, as well as other multiprocessor cache effects, including the extra cycles consumed by spin lock code.

Serializing instructions

The first noticeable multiprocessor effect is the performance impact of serializing LOCKed instructions. Instructions coded with the LOCK prefix are guaranteed to run uninterrupted and gain exclusive access to the designated memory locations. Locking the shared-memory bus delays any threads executing on other processors that need access to memory locations not currently resident in cache. In addition, there are a number of hardware-oriented operations performed by the operating system that implicitly serialize by locking the shared-memory bus on an Intel shared-memory multiprocessor. These include setting the active task state segment (TSS), which is performed during a context switch of any type. Intel hardware also automatically serializes updates of the Page Directory Entries and Page Table Entries that are used in translating virtual memory addresses to real memory locations. (This impacts the page replacement algorithm that Windows 2000 uses on Intel multiprocessors, as we will see in the next chapter.)

Intel documentation (see the Intel Architecture Software Developer's Guide: Volume 3, System Programming Guide, especially Chapter 7) describes some specific serializing instructions that force the processor executing them to drain the entire instruction execution pipeline before executing the instruction. Following execution of the serializing instruction, the pipeline is started up again. These serializing instructions include privileged operations that move values into internal Control and Debug Registers, for example. Serializing instructions also have the effect on the P6 of forcing the processor to re-execute out-of-order instructions. The performance impact of draining the instruction execution pipeline and re-executing micro-ops ought to be obvious.

As discussed in the previous chapter, current-generation Pentium and Pentium Pro Intel processors are pipelined, superscalar architectures. The performance impact of executing an instruction serialized with the LOCK prefix includes potentially stalling the pipelines of other processors executing instructions until the instruction that requires serialization frees up the shared-memory bus. As you can imagine, this can be a fairly big performance hit, too, which is solely a consequence of running your program in a multiprocessor environment. The cost of both sorts of instruction serialization contribute to at least some of the less-than-linear scalability that we can expect in a multiprocessor. This cost is very difficult to quantify, and certainly workload-dependent. Moreover, there is also very little one can do about this source of degradation. Without serializing instructions, multiple processors would simply not work reliably.

A final source of multiprocessor interference is interprocessor signaling instructions. These are instructions issued on one processor to signal another processor, for example, to wake it up to process a pending interrupt. Interprocessor signaling is quite expensive in performance terms.

Cache effects

In Chapter 4, we discussed how essential on-board CPU caching is to the performance of pipelined processors. Intel waited to introduce pipelining until there was enough real estate available on its 486 chips to incorporate an on-board cache. It should not be a big surprise to learn that one secondary effect of multiprocessor coordination and serialization is that it makes caching less effective. This, in turn, reduces the processor's instruction execution rate. To understand why SMPs impact cache effectiveness, we soon take a detour into the realm of cache coherence. From a configuration and tuning perspective, one intended effect of setting up an application to run with processor affinity is to improve cache effectiveness and increase the instruction execution rate. Direct measurements of both instruction execution rate and caching efficiency, fortunately, are available via the Pentium counters. A freeware tool called CPUmon, available at http://www.sysinternals.com/ntw2k/freeware/cpumon.shtml, supports the Pentium counters under Windows 2000.[3]

Spin locks

If two threads are attempting to access the same serializable resource, one thread will acquire the lock, which then blocks the other one until the lock is released. But what should the thread that is blocked waiting on a critical section do while it is waiting? An application program in Windows 2000 is expected to use a Win32 serialization service that puts the application to sleep until notified that the lock is available. Win32 serialization services arrange multiple threads waiting on a shared resource in a FIFO queue so that the queueing discipline is fair. This suggests that a key element of designing an application to run well on a shared-memory multiprocessor is to minimize the amount of processing time spent inside critical sections. The shorter the time spent executing inside a locked critical section of code, the less time other threads are blocked waiting to enter it. Much of the re-engineering work Microsoft did on NT 4.0 and again in Windows 2000 was to redesign the critical sections internal to the OS to minimize the amount of time kernel threads would have to wait for shared resources.

If critical sections are designed appropriately, then threads waiting on a locked critical section should not have long to wait! Moreover, while a thread is waiting on a lock, there may be nothing else for it do. For example, a thread waiting on the Windows 2000 Scheduler lock to acquire a ready thread to dispatch can perform no useful work until it has successfully acquired that lock. So, consider any blocked thread waiting on a lock that is a kernel thread. The resource the thread is waiting for is required before any other useful work on the processor can be performed. The wait can be expected to be of very short duration. Under these circumstances, the best thing to do may be to loop back and test for the availability of the lock again. We have already seen an example of code that tests for availability of a lock, then enters a critical section by setting the lock using a serializing instruction. If the same code finds the lock is already set and branches back to retest the lock, this code implements a spin lock. If you were able to watch this code's execution, it would appear to be spinning back and forth within a very tight loop of just a few instructions.

Spin locks are used in many, many different places throughout the operating system in Windows 2000, since operating system code waiting for a critical section to be unlocked often has nothing better to do during the (hopefully) very short waiting period. In addition, device drivers are required to use spin locks to protect data structures across multiple processors. Windows 2000 provides a standard set of spin lock services for device drivers to use in ISRs and kernel threads outside of ISRs. (See the DDK documentation on, for example, KeInitializeSpinLock, IoAcquireCancelSpinLock, and related services for more detail.) These standard services allow device drivers written in C language to be portable across versions of Windows NT running on different hardware. In Windows 2000, where these portability concerns are no longer present, these ready-made HAL runtime services still serve to insulate C language programmers from the intricacies of Intel's multiprocessing support.

In Windows NT Version 4.0, you can use the Thunk function in the x86 Perf Meter application in the Resource Kit (pperf.exe--the same application used to access the Pentium counters) to observe spin lock activity. For example, from the Thunk menu, select the ntfs.sys filesystem driver module, using hal.dll as the target module. Then select the KfAcquireSpinLock and KfReleaseSpinLock for monitoring, as illustrated in Figure 5-9. If you then generate some ntfs file metadata activity, like emptying the Recycle bin, you will observe ntfs driver code using the HAL spin lock function to protect critical sections of code. Now consider a 2-, 4-, or 8-way multiprocessor with an ntfs filesystem. ntfs.sys functions can be executed on any processor where there is an executing thread that needs access to disk files. In fact, in multiprocessors it is likely that ntfs functions will execute concurrently (on different processors) from time to time. As Figure 5-9 illustrates, ntfs.sys uses HAL spin lock functions to protect critical sections of code, preserving the integrity of the filesystem in a multiprocessor environment.

Figure 5-9. The Thunk function can be used to monitor spin lock activity
Fig 9

Spin lock code is effectively dormant when run on a single-processor system, but can consume (or waste, depending on your point of view) a significant number of processor cycles on a multiprocessor. Again, the use of spin locks is, to a large degree, unavoidable. The performance implication of spin locks is that processor utilization increases, but no useful work is actually being performed. In this way, simple measures of processor utilization can be misleading. Windows 2000 client applications care about throughput and response time, which may be degraded on a multiprocessor even as measurements of CPU utilization look rosy.

The combined impact of serializing instructions, interprocessor signaling, diminished cache effectiveness, and the consumption of processor cycles by spin lock code limit the scalability of shared-memory multiprocessors in Windows 2000 and other shared-memory multiprocessing operating systems. Furthermore, each additional processor added to the configuration amplifies these multiprocessor scalability factors. These factors make sizing, configuring, and tuning large-scale n-way Windows 2000 multiprocessors a very tricky business. Just how tricky should become more apparent after considering the cache coherence problem in the next section.

Cache Coherence

The cache effects of running on a shared-memory multiprocessor are probably the most salient factors limiting the scalability of this type of computer architecture. The various forms of processor cache, including Translation Lookaside Buffers (TLBs), code and data caches, and branch prediction tables, all play a critical role in the performance of pipelined machines like the Pentium, Pentium Pro, Pentium II, and Pentium III. For the sake of performance, in a multiprocessor configuration each CPU retains its own private cache memory, as depicted in Figure 5-10. We have seen that multiple threads executing inside the Windows 2000 kernel or running device driver code concurrently can attempt to access the same memory locations. Propagating changes to the contents of memory locations cached locally to other engines with their own private copies of the same shared-memory locations is a major issue, known as the cache coherence problem in shared-memory multiprocessors. Cache coherence issues also have significant performance ramifications.

Maintaining cache coherence in a shared-memory multiprocessor is absolutely necessary for programs to execute correctly. While independent program execution threads operate independently of each other for the most part, sometimes they must interact. Whenever they read and write common or shared-memory data structures, threads must communicate and coordinate accesses to these memory locations. This coordination inevitably has performance consequences. We illustrate this side effect by drawing on the example discussed earlier, where two kernel threads are attempting to gain access to the Windows 2000 Scheduler Ready Queue simultaneously. A global data structure like the Ready Queue that may be accessed by multiple threads executing concurrently on different processors must be protected by a lock on a multiprocessor. Let's look at how a lock word value set by one thread on one processor is propagated to cache memory in another processor where another thread is attempting to gain access to the same critical section.

In Figure 5-10, Thread 0 running on CPU 0 that has just finished updating the Windows 2000 Scheduler Ready Queue is about to exit a critical section. Upon exiting the critical section of code, Thread 0 resets the lock word at location mem1 using a serializing instruction like XCHG. Instead of locking the bus during the execution of the XCHG instruction, the Intel P6 operates only on the cache line that contains mem1. This is to boost performance. The locked memory fetch and store that the instruction otherwise requires would stall the CPU 0 pipeline. In the Intel architecture, if the operand of a serializing instruction like XCHG is resident in processor cache in a multiprocessor configuration, then the P6 does not lock the shared-memory bus. This is a form of deferred write-back caching, which is very efficient. Not only does the processor cache hardware use this approach to caching frequently accessed instructions and data, but so do Windows 2000 systems software and hardware cached disk controllers, for example.

In the interest of program correctness, updates made to private cache, which are deferred, ultimately must be applied to the appropriate shared-memory locations before any threads running on other processors attempt to access the same information. Moreover, as Figure 5-10 illustrates, there is an additional data integrity exposure because another CPU can (and frequently does) have the same mem1 location resident in cache. The diagram illustrates a second thread that is in a spin loop trying to enter the same critical section. This code continuously tests the contents of the lock word at mem1 until it is successful. For the sake of performance, the XCHG instruction running on CPU 1 also operates only on the cache line that contains mem1 and does not attempt to lock the bus because that would probably stall the other processor's instruction execution pipeline. We can see that unless there is some way to let CPU 1 know that code running on CPU 0 has changed the contents of mem1, the code on CPU 1 will spin in this loop forever. The Intel P6 processors solve this problem in maintaining cache coherence using a method conventionally called snooping.

Figure 5-10. The cache coherence problem
Fig 10

MESI Snooping Protocol

Snooping protocols to maintain cache coherence have each processor listening to the shared-memory bus for changes in the status of cache resident addresses that other processors are operating on concurrently. Snooping requires that processors place the memory addresses of any shared cache lines being updated on the memory bus. All processors listen on the memory bus for memory references made by other processors that affect memory locations that are resident in their private cache; thus the term snooping. The term snooping also has the connotation that this method for keeping every processor's private cache memory synchronized can be performed in the background (which it is) without a major performance hit (which is true, but only up to a point). In practice, maintaining cache coherence is a complex process that interferes substantially with normal pipelined instruction execution and generates some serious scalability issues.

Let's illustrate how the Intel snooping protocol works, continuing with our Ready Queue lock word example. CPU 1, snooping on the bus, recognizes that the update to the mem1 address performed by CPU 0 invalidates its cache line containing mem1. Then, because the cache line containing mem1 is marked invalid, CPU 1 is forced to refetch mem1 from memory the very next time it attempts to execute the XCHG instruction inside the spin lock code. Of course, at this point CPU 0 has still not yet updated mem1 in memory. But snooping on the shared-memory bus, it discovers that CPU 1 is attempting to read the current value of mem1 from memory. CPU 0 intercepts and delays the request. Then CPU 0 writes the cache line containing mem1 back to memory. Then, and only then, is CPU 1 allowed to continue refreshing the corresponding line in its private cache and update the lock word.

The cache coherence protocol used in the Intel architecture is denoted MESI, which corresponds to the four states of each line in processor cache: modified, exclusive, shared, or invalid (see Table 5-1). At any one time, a line in cache is in one and only one of these four states.

Table 5-1: The states of the MESI cache coherence protocol

State

Description

Modified

Valid line, modified, guaranteed that this line only exists in this cache, the corresponding memory line is stale

Exclusive

Valid line, unmodified, guaranteed that this line only exists in this cache

Shared

Valid line, unmodified, line also exists in at least one other cache

Invalid

An invalid line that must be refreshed from memory

The MESI protocol very rigidly defines what actions each processor in a multiprocessor configuration must take based on the state of a line of cache and the attempt by another processor to act on the same data. The previous scenario illustrates just one set of circumstances that the MESI protocol is designed to handle. Let's revisit this example using the Intel MESI terminology.

Suppose that Thread 1 running in a spin lock on CPU 1 starts by testing the lock word at location mem1. The 32 bytes containing this memory location are brought into the cache and flagged exclusive because they are currently contained only in the CPU 1 cache. Meanwhile, when CPU 0 executes the first part of the XCHG instruction on mem1 designed to reset the lock, the 32 bytes containing this memory location are brought into the CPU 0 cache. CPU 1, snooping on the bus, detects CPU 0's interest in a line of cache that is currently marked exclusive and transitions this line from exclusive to shared. CPU 1 signals CPU 0 that it too has this line of memory in cache, so CPU 0 marks the line shared, too. The second part of the XCHG instruction updates mem1 in CPU 0 cache. The cache line resident in CPU 0 transitions from shared to modified as a result. Meanwhile, CPU 1, snooping on the bus, flags its corresponding cache line as invalid. Subsequent execution of the XCHG instruction within the original spin lock code executing on CPU 1 to acquire the lock finds the cache line invalid. CPU 1 then attempts to refresh the cache line from memory, locking the bus in the process to ensure coherent execution of all programs. CPU 0, snooping on the bus, blocks the memory fetch by CPU 1 because the state of that memory location in CPU 0 cache is modified. CPU 0 then writes the contents of this line of cache back to memory, reflecting the current data in CPU 0's cache. At this point, CPU 1's request for memory is honored, and the now current 32 bytes containing mem1 are brought into the CPU 1 cache. At the end of this sequence, both CPU 0 and CPU 1 have valid data in cache, with both lines in the shared state.

The MESI protocol ensures that cache memory in the various independently executing processors is consistent no matter what the other processors are doing. Clearly, what is happening in one processor can interfere with the instruction execution stream running on the other. With multiple threads accessing shared-memory locations, there is no avoiding this. These operations on shared memory stall the pipelines of the processors affected. For example, when CPU 0 snoops on the bus and finds another processor attempting to fetch a line of cache from memory that is resident in its private cache in a modified state, then whatever instructions CPU 0 is attempting to execute in its pipeline are suspended. Writing back modified data from cache to memory takes precedence because another processor is waiting. Similarly, CPU 1 running its spin lock code must update the state of that shared line of cache when CPU 0 resets the lock word. Once the line of cache containing the lock word is marked invalid on CPU 1, the serializing instruction issued on CPU 1 stalls the pipeline because cache must be refreshed from memory. The pipeline is stalled until CPU 0 can update memory and allow the memory fetch operation to proceed. All of this results in time-consuming delays that reduce the CPU instruction execution rate on both processors.

One not-so-obvious performance implication of snooping protocols is that they utilize the shared-memory bus heavily. Every time an instruction executing on one processor needs to fetch a new value from memory or update an existing one, it must place the designated memory address on the shared bus. The bus itself is a shared resource. With more and more processors executing, the bus tends to get quite busy. When the bus is in use, other processors must wait. Utilization of the shared-memory bus is likely to be the most serious bottleneck impacting scalability in multiprocessor configurations of three, four, or more processing engines.

Pentium Pro Hardware Counters

The measurement facility in the Intel P6 or Pentium Pro processors (including Pentium II, III, and IV) was strengthened to help hardware designers cope with the demands of more complicated multiprocessor designs. By installing the CPUmon freeware utility available for download at http://www.sysinternals.com, system administrators and performance analysts can access these hardware measurements. While the use of these counters presumes a thorough understanding of Intel multiprocessing hardware, we hope that the previous discussion of multiprocessor design and performance has given you the confidence to start using them to diagnose specific performance problems associated with large-scale Windows 2000 multiprocessors. The P6 counters provide valuable insight into multiprocessor performance, including direct measurement of the processor instruction rate, Level 2 cache, TLB, branch prediction, and the all-important shared-memory bus.

CPUmon, a freeware Pentium counter utility, lets you enable and then disable the Pentium counters for a specific interval, as illustrated in Figure 5-11. The CPUmon user interface, which separates the various counters that are available into different classes, is user-friendly. (CPUmon assigns the counters different names from the ones the x86 Perf Meter application does, which is confusing. Even more confusing, some of the available P6 counters are missing entirely from the program.) Once the Pentium counters you select are enabled, Version 2 of CPUmon provides a Perflib DLL that lets you view them continuously using System Monitor, as illustrated in Figure 5-12.

Figure 5-11. The CPUmon freeware utility is used to access the Pentium counters
Fig 11

Figure 5-12. CPUmon lets you view Pentium counter measurements using System Monitor
Fig 12

Shared-Memory Bus Utilization Measurements

The shared-memory bus measurements can often shed the most light on multiprocessor performance. Table 5-2 lists the various P6 bus measurement counters, using the Microsoft counter names from the NT 4.0 Reskit's counters.hlp. The first observation we make looking at these counter names and their (unilluminating) Explain Text is that many of them are arcane and esoteric. To understand what "Bus DRDY asserted clocks/second" means might send us scurrying to the Intel architecture manuals, which, unfortunately, are of little use. A second observation, which may be triggered by viewing the counters under controlled conditions, is that some of them probably do not mean what they appear to. For example, the Bus LOCK asserted clocks/sec counter consistently appears to be zero on both uniprocessor and multiprocessor configurations. Not much help there. The shared-memory bus is driven at the processor clock rate, and some counter names use the term cycles while others use the term clocks. The two terms are interchangeable. Although not explicitly indicated, some counters that mention neither clocks nor cycles are also measured in clocks. For example, an especially useful measure is Bus requests outstanding/sec, which measures the total number of clocks the bus is busy.

Table 5-2: P6 shared-memory bus hardware measurements

Bus all transactions/sec

Bus invalidate transactions/sec

Bus BNR pin drive cycles/sec

Bus IO transactions/sec

Bus burst read transactions/sec

Bus LOCK asserted clocks/sec

Bus burst transactions (total)/sec

Bus memory transactions (total)/sec

Bus clocks receiving data/sec

Bus partial transactions/sec

Bus CPU drives HIT cycles/sec

Bus partial write transactions/sec

Bus CPU drives HITM cycles/sec

Bus read for ownership trans/sec

Bus deferred transactions/sec

Bus requests outstanding/sec

Bus DRDY asserted clocks/sec

Bus snoop stalled cycles/sec

Bus instruction fetches/sec

Bus writeback transactions/sec

Bus memory transactions and Bus all transactions measure the number of bus requests. The bus measurements are not processor-specific, since the memory bus is a shared component. If you run the CPUmon shareware utility, you can see all processors report nearly identical values for BUS_TRAN_ANY, while, as expected, the number of instructions retired is processor-specific (see Figure 5-13). The memory bus that the processors share is a single resource, subject to the usual queuing delays. We will derive a measure of bus queuing delay in a moment.

Figure 5-13. CPUmon results for two separate processors
Fig 13

Now, let's look at some more P6 measurement data from a multiprocessor system. A good place to start is with Bus all transactions/sec, which, as noted, is the total number of bus requests. Figure 5-14 shows that when the bus is busy, it is usually due to memory accesses. Bus memory transactions/sec represent over 99% of all bus transactions. The measurement data is consistent with our discussion suggesting that bus utilization is often the bottleneck in shared-memory multiprocessors that utilize snooping protocols to maintain cache coherence. Every time any processor attempts to access main memory, it must first gain access to the shared bus.

Figure 5-14. Memory transactions often represent over 99% of all bus transactions
Fig 14

Larger Level 2 caches help reduce bus traffic, but there are diminishing returns from caches that are, in effect, too big. Each time a memory location is fetched directly from a Level 1 or Level 2 cache, it is necessary to broadcast the address on the bus. At some point, larger caches do not result in significant improvements in the rate of cache hits, yet they increase the management overhead necessary to maintain cache coherence. In this regard, both the rate of Level 2 cache misses and the number of write-back memory transactions are relevant, as both actions drive bus utilization (see Figure 5-15). The P6 Level 2 cache performance measurements are especially useful in this context for evaluating different processor configurations from Intel and other vendors that have different amounts of Level 2 cache. By accessing this measurement data, you can assess the benefits of different configuration options directly. This is always a better method than relying on some rule-of-thumb value proposed by this or that performance expert, perhaps based on a measurement taken running a benchmark workload that does not reflect your own.

Figure 5-15. Level 2 cache lines input (misses) as a function of bus memory transactions
Fig 15

The Bus snoop stalled cycles/sec counter has intrinsic interest on a multiprocessor. A high rate of stalls due to snooping is a direct indicator of multiprocessor contention. See Figure 5-16, which again was measured on a two-way multiprocessor. The number of snooping-induced stalls is low here. Even though as a percentage of the total resource stalls they are practically insignificant here, this is still a measurement that bears watching.

Figure 5-16. The number of stalls due to snooping is relatively small in this example
Fig 16

Next, consider the Bus requests outstanding counter, which is a direct measurement of bus utilization in clocks. By also monitoring Bus all transactions, you can derive a simple response time measure of bus transactions measured as the average clocks per transaction:

Average clocks per transactions = Bus requests outstanding ÷ Bus all transactions

Assuming that contention for the shared-memory bus is a factor, saturation of the bus on an n-way multiprocessor will likely drive up bus transaction response time, measured in clocks per transaction on average. Figure 5-17, a Perfmon screen shot, provides a uniprocessor baseline for this calculation. Since Perfmon cannot perform any arithmetic calculations, we exported the chart data to a file so that it could be processed in an Excel spreadsheet. Using Excel, we divided Bus requests outstanding by Bus all transactions to derive the average number of clocks per transaction. (Bear in mind that we can access only two counters at a time. Since memory transactions typically represent more than 99% of all bus transactions, it is safe to assume that clock cycles calculated using this formula genuinely do reflect the time it takes the processor to access memory.) The average clocks per bus transaction in this example generally fall in the range of 10-30 clocks, with an average of about 18 clocks per transaction. These calculations are summarized in the chart shown in Figure 5-18. In the case of a uniprocessor, the memory bus is a dedicated resource and there is no contention. Now compare the uniprocessor baseline in Figure 5-18 to the two-way multiprocessor in Figure 5-19. Here the average clocks per transaction is about 30, coming in at the high end of the uniprocessor range. The average number of clocks per bus transaction increases because of queuing delays accessing the shared-memory bus in the multiprocessor. In a multiprocessor, there is bus contention. By tracking these P6 counters, you can detect environments where adding more processors to the system will not speed up processing any because the shared-memory bus is already saturated. Bus contention tends to set an upper limit on the performance of a multiprocessor configuration, and the P6 counters let you see this.

Figure 5-17. Tracking P6 bus measurements on a uniprocessor using Performance Monitor.
Fig 17

Figure 5-18. Average clocks per bus transaction on a uniprocessor
Fig 18

Figure 5-19. Average clocks per bus transaction on a two-way multiprocessor
Fig 19

With the Pentium III, Intel introduced a new instruction called PAUSE, which reduces the bus contention that results from repeatedly executing spin lock code. The Windows 2000 HAL adds a new queued spin lock function that exploits the new hardware instruction where available. Device driver code trying to enter a critical section protected by a queued spin lock issues the PAUSE instruction, referencing the lock word protecting that piece of code. PAUSE halts the CPU until the lock word is changed by a different executing thread running on another processor. At that point, the PAUSEd processor wakes up and resumes execution.

The PAUSE instruction was designed to eliminate the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location. Instead of repeatedly executing code that tests the value of a lock word to see if it is safe to enter a critical section, queued spin locks wait quietly without generating any bus transactions. Since saturation of the shared memory bus is an inherent problem in shared-memory multiprocessors, this innovation should improve the scalability of Windows 2000. At the same time, Intel designers also boosted the performance of the Pentium III system bus significantly, which should also improve multiprocessor scalability under Windows 2000.

Optimization and SMP Configuration Tuning

Having identified the major issues in multiprocessor scalability and performance, it is now time to turn our attention to what can be done about them. In this section we address some of the important performance issues that developers of scalable server applications must face when confronted with multiprocessors. Our discussion here is cursory, as we cannot do this complex subject justice in this brief space. Suffice to say that building large, complicated application software for any platform is very difficult, and Windows 2000 is no exception. Hopefully, we will succeed in laying down some basic helpful guidelines.

Then, we review the steps that system administrators can take to get the most out of large, multiprocessor configurations. Unfortunately, very few applications provide tuning knobs that are useful in a multiprocessing environment beyond the ability to control the number of threads they create. Since only a multithreaded application can take advantage of multiple processors, being able to control the number of threads created by the application is something, at least. There are two important exceptions, however. There are tuning parameters to control threading, priority, and processor affinity that influence the performance of SQL Server running on a multiprocessor. We also look at a useful processor affinity setting for network interface cards. However, because most applications lack the appropriate tuning knobs, we then turn to a discussion of a third-party package that assists in multiprocessor configuration and tuning. The NCR SMP Utilization Manager add-on to Windows 2000 provides valuable assistance in configuring very large, n-way multiuser Windows 2000 servers.

Configuring Server Applications for Multiprocessing

The programming model used in Windows 2000 is designed around execution threads and processes, where threads are scheduled by the operating system, not by the process that created them. If you are writing a server application in Unix, a call to fork( ) or vfork( ) creates a child process that runs independently of the parent. The same application written for Windows 2000 would probably call CreateThread to create independent processing threads. (Of course, you can continue to use multiple processes in Windows 2000. The Win32 API call CreateProcess corresponds to the standard Unix fork call to spawn a child process.) Many applications ported to Windows 2000 from Unix betray their operating system heritage and use fork instead of CreateThread. This is partially because it is a big effort to rewrite code that already works. Moreover, developers trying to maintain some degree of cross-platform compatibility with their applications are understandably reluctant to introduce platform-specific optimizations. Nevertheless, to achieve the best results on Windows 2000, programmers should adopt Windows 2000's programming model that relies on multiple threads running within a process. Generally, wherever multiple processes are used in Unix, the Win32 programmer should use multiple threads.

Multithreaded Applications

A key advantage of threads is that they share the 4 GB virtual memory address space of the parent process (more about virtual memory management in Windows 2000 in Chapter 6), so they require less work for the operating system to create and destroy. Sharing an address space means that running multiple threads within the same process is also safer, since those threads can communicate using private virtual memory. This tends to be significantly safer than using shared memory for interprocess communication (IPC), which Windows 2000 also supports. A bug that fatally overruns an array or a list structure in private virtual memory damages the application only and is easily recovered by the OS. A similar bug in shared memory can damage other applications and, potentially, the operating system itself.

Having decided to use threads, the next question is, "How many threads does my application need?" There are no hard-and-fast rules for creating scalable server applications, but some practical performance guidelines can be discussed. Even though creating multiple threads entails less overhead than creating multiple processes, and there is no practical limit on the number of threads a server application can create, managing multiple threads does require some system overhead. For example, system and application memory is required for every thread that is created, so you should not create lots of unnecessary threads. But consider a server application that supports multiple clients. How many clients can your application support on a Windows 2000 server with a single 200 MHz processor and 64 MB of memory? How about on a Windows 2000 server with eight 400 MHz processors and 1 GB of memory? Not only does the second system use processing engines that are twice as fast, there are eight times as many of them, with plenty more memory to boot. You will likely need lots of threads for your server application to take advantage of all this extra computing capacity. For instance, unless it can run eight threads concurrently, the application cannot utilize all the processing resources that the bigger server has to offer. On the other hand, the right number of threads on the eight-way server will probably overwhelm the smaller-scale, one-processor server system.

Normally, because Windows 2000 systems scale across such a wide range of environments, you must establish useful defaults that scale with the number of processors, but ultimately let the system administrator override these defaults by providing a tuning knob. Consider, for example, the approach taken in Microsoft Internet Information Server (IIS) to setting performance and tuning options. The Performance tab in the WWW Service Master Properties dialog box is illustrated in Figure 5-20. The dialog box simply asks the system administrator to specify the number of HTTP hits per day expected: fewer than 10,000, fewer than 100,000, or more than 100,000. Presumably, IIS does the rest: allocating the right number of threads and the right amount of memory for caching HTML pages based on this setting.

Figure 5-20. IIS threading and memory usage is automatically adjusted using this dialog box
Fig 20

This kind of sweeping tuning control knob introduces some obvious problems. For instance, what is the best setting to use if you expect only 20,000 hits per day? Or what if you expect 2 million hits? This simplistic tuning knob also ignores basic issues involving workload characterization that are crucial to performance and tuning of this server application: how large are the HTML files, what is the distribution of their access, can they be cached effectively, etc. It also ignores the hardware environment: the processor, memory, disk, and networking resources available to process this workload. Happily, IIS, like many other Windows 2000 server applications, has several Registry settings designed to configure and control the performance of the application with considerably greater precision.

The first consideration for a server application that needs to scale across such a wide range of computing hardware is to create more worker threads on larger systems. One approach is to examine the hardware environment by looking at the Registry during initialization to see how many engines are available, and then create threads accordingly. Another is to provide an external parameter that permits an administrator to fine-tune the number of threads that your application creates. We have already seen a number of well-behaved Windows applications that have such controls, including the Windows 2000 File Server function in services.exe, MS SQL Server, MS Exchange, and even the Windows 2000 kernel itself.

Thread Pooling

A more dynamic method of scaling applications is to create a pool of worker threads, automatically increasing or decreasing the number of threads depending on the demand. This is how many server applications are built. Often, worker threads structured as a pool are dispatched from an internal queue as units of work arrive, as depicted in Figure 5-21. As work requests arrive, the application:

Figure 5-21. The structure of a thread pooling application
Fig 21

  1. Wakes up a Receiver thread, which
  2. Associates the unit of work with a work item on the available queue, then
  3. Wakes up a worker thread sleeping on the available thread queue to begin processing the request.
  4. If no worker threads are available, the Receiver thread queues the work item and signals a Control thread to create an additional worker thread.
  5. Meanwhile, any worker thread that completes its processing on behalf of some unit of work posts the Sender thread, and
  6. Checks the work queue prior to going back to sleep.

Usually, control-oriented Sender and Receiver threads are simple programs that manage the input and output queues, and the worker threads perform the bulk of the actual application-related processing. Increasing the number of worker threads created in the pool (and the number of available work items) is one way to make sure the application scales on multiprocessors. With all these different threads potentially executing concurrently on a multiprocessor, it is easy to see how complicated the interaction between threads can get. Obviously, using Windows 2000 synchronization objects to manage shared data structures like the work unit queues and the thread pool itself is critical to ensure the application runs correctly on a multiprocessor.

Given this application architecture, how many worker threads should be created? To find the answer, view the application as an M/M/n queuing system. The optimal number of processing threads, n, depends on the rate at which work arrives, the average amount of time a worker thread spends processing a request, and the capacity of the machine to perform the work involved. (In this chapter, we are concerned only with CPU resources, but memory, disk, and networking resources all need to be considered.) Each worker thread available represents only an opportunity to receive service because the resource needed (the processor, the disk, the database) may itself be overloaded. But it is safe to assume that if the number of items waiting in the work queue is greater than one per processor, and the processors are less than, say, 60% utilized, increasing the number of worker threads will not do any harm and may actually do some good.

If increasing the number of available threads results in increased processor utilization (a good enough measure of application throughput for now), then you are on the right track. It may seem counterintuitive, but you can then continue to increase the number of threads until the processor saturates at close to 100% busy. Of course, at this point you have worker threads waiting in the Ready Queue for CPU service, but this is merely changing where the work requests are queued: from the application work queue to the processor Ready Queue. From a pure performance standpoint, it does not make much difference where application work units are queued. What may happen, though, is that other workloads on the same system suffer because the processors are overloaded. That consideration may force you to throttle back the thread pooling application so it does not overwhelm all other work on the system.

When a unit of work arrives requesting some service and no worker threads are available, the thread pooling application creates a new thread to process it. Just in case the entire thread pool gets backed up waiting for some resource (and because the overhead of thread creation is not completely negligible), server applications usually establish some upper limit on the total number of worker threads that can be created. Of course, if an existing thread is available, it is used instead. Normally, threads created during a period of heavy demand are retained for some time afterwards, even though they are idle. It is more efficient to keep a few idle worker threads alive than for your application to be constantly creating and destroying threads.

To summarize, multithreaded Windows 2000 server applications that operate a thread pool dynamically create threads as needed, up to some user-defined upper limit. As a general rule, if the application is currently running at or near its maximum allotment of threads, the work queue is backed up, and processor resources are available, then it is time to increase the limit on creating new threads. This rule of thumb is far from foolproof, unfortunately. If worker threads are consistently blocked at the disk, for example, adding more processing threads is not likely to provide any performance relief. Once the work item queue is depleted, the thread pooling application is likely to respond to further work requests with an error condition, which is also something that can be avoided by increasing the size of the work item pool.

Monitoring Thread Pooling Applications

An application that relies on thread pooling must provide feedback on its performance so the administrator can set a reasonable thread upper limit. The thread pooling application should monitor the rate at which work requests arrive, the current size of the request queues, the current number of threads in use, and the number of available threads. Of course, direct measurement of request wait time and execution time is extremely useful, too. Instrumentation added to the application should report some of the following useful performance metrics, ideally using a Perflib DLL, so that the application measurements can be correlated with other aspects of system activity:

With applications that do not report response time metrics, you can compute average response time using Little's Law if you know the transaction arrival rate and the number of transactions in the system (namely, the number of busy work queue items, including requests waiting in the work queue). As discussed in Chapter 1, Little's Law postulates:

Queue Length = Arrival Rate × Response Time

Under the right circumstances, we can apply Little's Law to calculate response when measures of the arrival rate and the number of concurrent work requests in the system are available. The Microsoft Exchange Message Transport component, for example, provides performance counters that measure the rate of incoming and outgoing messages, the length of the work queue, and the number of concurrent threads in use. It does not report response time, but you can estimate it using Little's Law.

With these basic reporting requirements in mind, let's look at some other Windows 2000 server applications that utilize thread pooling and see how they stack up. Table 5-3 summarizes the thread pooling measurement statistics that IIS Active Server Pages and Windows 2000 networked file services supply. These two thread pooling applications are extensively instrumented, as discussed in detail in the next sections.

Table 5-3: ASP and file server thread pooling application performance counters

Measurement statistic

Active Server Pages counters

Server and Server Work Queues Counters

Active worker threads

Requests Executing

Active threads, per Server Work Queue

Maximum worker threads

 

Active Threads + Available Threads, per Server Work Queue

Work items in use

Requests Executing

Max(Available Work Items) - Available Work Items, per Server Work Queue

Work queue depleted

Requests Rejected

Work item shortages, Blocking Requests Rejected

Work request rate

Requests/sec

Context Blocks Queued/sec

Active requests

Requests Executing

 

Queued requests

Requests Queued

Queue Length, per Server Work Queue

IIS Active Server Pages

IIS has two tuning parameters that control the size of its thread pool: PoolThreadLimit, which defaults to two threads for each MB of physical memory, and MaxPoolThreads, which controls the number of threads IIS allows to run concurrently. The default is 10 active threads per processor, which applies to all active IIS connections, including both HTTP and FTP connection requests. These tuning parameters are located at HKLM\SYSTEM\CurrentControlSet\Services\InetInfo\Parameters. Figuring out how to set these IIS threading parameters is difficult, though, because IIS does not report statistics on activity within the thread pool. Only data on the number of concurrent connections for the various IIS services is available from the System Monitor measurements.

IIS with Active Server Pages is a little different. Active Server Pages are scripted, which means they represent code that is executed by IIS at runtime to create the web page to display dynamically, rather than referencing static HTML pages (files) that are simply transmitted across the wire. ASP is frequently used when it is necessary to access a SQL Server or Oracle database containing information to be displayed. Not surprisingly, ASP runtime services are configured as a thread pool, as are the ODBC connections to SQL Server or some other database. There are two Registry settings for IIS Active Server Pages that impact the performance of ASP on a multiprocessor: ProcessorThreadMax and RequestQueueMax. These settings are added to the Registry at the following location: HKLM\SYSTEM\CurrentControlSet\Services\W3SVC\ASP\Parameters. Using the Windows 2000 System Monitor, you should monitor the number of ASP Requests/sec, the number of Requests Executing, and the number of Requests Queued. ASP also reports measures called Request Wait Time and Request Execution Time, which reflect the amount of time the last request processed spent waiting in the queue and executing.

It is useful to compare the Request Execution Time and Wait Time values to the average response time of requests calculated using the Little's Law formula:

Requests Executing + Requests Queued = Requests/sec × Average Response Time

This calculation is illustrated in an example shown in Figures 5-22 through 5-24, which report on measurement data collected from a busy e-commerce web site running IIS and ASP. There are several caveats about the validity of using Little's Law to calculate ASP request response time from the available measurement data, however. Little's Law requires very few assumptions, as discussed in Chapter 1. But the one assumption it does require is that the system being measured is in a state of equilibrium, where the number of requests entering service is equal to the number of requests exiting the system. When this assumption cannot be satisfied because the system is blocked, the Little's Law calculation is meaningless. Another important assumption is that the ASP request rate itself must be sufficiently high to justify this statistical approach. Little's Law can normally be applied successfully when several transactions per second are processed, but not when only a few transactions per hour are processed. Finally, keep in mind that the Requests/sec counter is a continuously measured value reflecting all activity during the interval, while the number of Requests Executing and Requests Queued are instantaneous values measured at the end of the current interval. With measurement intervals longer than a minute or so, it is easy for these two types of measurements to be out of sync. However, with these warnings in mind, you should find it extremely useful to calculate ASP average response time, instead of relying solely on the direct service and queue time measures that the System Monitor reports that apply only to the last ASP request processed in the interval.

Figure 5-22 shows the ASP request rate (on the right-hand axis) and the values of the Requests Executing and Requests Queued counters (shown as a stacked bar chart graphed against the left-hand axis). In this example, the ASP request execution rate peaks at about five requests per second, which serves as an upper limit of the capacity of this system to field web requests. Requests arriving at a higher rate lead to a large number of queued requests, as illustrated. Notice the large quantity of queued requests at several times during the day, with the largest spike at 15:00 when the ASP queue exceeds 120 work requests.

Figure 5-22. Requests/sec is not keeping pace with the arrival rate of ASP requests
Fig 22

Figure 5-23 shows the measured Request Execution Times and Wait Times reported over the same interval. Remember that both the Request Execution Time and Request Wait Time reported represent processing time for only the last request processed. Nevertheless, we can see that some of the times reported are quite large, particularly during intervals that correspond to large spikes in the number of ASP requests queued for processing in Figure 5-22.

Figure 5-23. The measured Request Execution Times and Wait Times reported
Fig 23

Finally, Figure 5-24 shows the results of making the Little's Law calculation to derive a measure of response time for this IIS ASP application. The calculated average response time is usually below five seconds, except for several spikes that correspond to periods where there was a large backlog of requests in the ASP processing queue. These calculated measures are reasonably consistent with the actual measurements for the last ASP request processed shown in 5-23. Most of those requests were processed in less than five seconds, too. It is always worthwhile to compare a direct measure of response time against a value calculated using Little's Law.

Figure 5-24. The ASP Response Time estimated here applies Little's Law
Fig 24

Since ASP requests issuing database calls can block inside your back-end Oracle or SQL Server database, tuning a large-scale ASP environment is not as simple as increasing the ASP request processing level. It may also entail understanding and tuning your database configuration. We discuss the performance of IIS and Microsoft's Active Server Pages technology in more detail in Chapter 12.

File Server

The Windows 2000 Server service, which provides file services to network clients, is a venerable thread pooling application that dates back to Windows 2000's heritage as OS2 LanManager. Server tuning options still reside in the Registry key HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters. The File Server service consists of several threads running within services.exe. (To see which services.exe threads are associated with Server, try stopping the Server service while you are monitoring the services.exe process threads.) Network clients accessing files and directories that reside on a file server generate work requests. (Server is installed on NT Workstation and Windows 2000 Professional, too, which can both serve quite nicely as peer-to-peer networked file servers.)

Network file sharing remains a mainstay application in Windows 2000, and the application has elaborate performance and tuning options that are worthy of its stature. It also produces lots of performance counters so that you can monitor it effectively. Knowing that it is a thread pooling application is extremely helpful in understanding the various performance options and performance data that the Server application provides. Here's how the application works.

As file requests are received, a free work item is allocated, which is handed off to a worker thread to process. Since many networked file requests involve doing I/O to disk (many others are cached), threads performing this function can block. While a request is blocked waiting for I/O to complete, the Server worker thread parks that request on the Blocking Queue and looks around for more work to do.

Meanwhile, nonblocking work requests that are waiting are assigned to separate work queues per processor. This symmetric multiprocessor support spreads the file server load evenly across all available processors. This is accomplished using a file server ISR and a DPC that queues incoming file server requests on the work queue associated with the processor that accepted and processed the network request. Server creates separate work queues for each processor available on a multiprocessor to avoid having to use spin locks to protect access to a single shared work queue. With worker threads and a work queue constructed for each processor, there is no need to protect these dedicated work queues from worker threads executing concurrently on a different processor. Within a processor, since only one server thread can be executing at a time, there is no need for spin locks either. This allows the file server services to scale extremely well as more processors are added. If the work item queue from one processor is depleted, available work items can then be borrowed from other processor work queues.[4]

This information about how File Server works should clarify the performance counters that are produced. The Server object contains global counters, while the server work queues contain separate instances for each processor work queue (numbered 0, 1, 2, etc.) and a Blocking Queue instance. Some of the key performance measures were identified in Table 5-3.

File Server tuning options include Registry settings for MaxThreadsperQueue, MaxRawWorkItems, and others, and the important Size parameter, which is used to specify the amount of internal memory allocated. The Server Size default is Large. If file requests are received but Server has no available Work Items, the request is rejected. By monitoring the Work Item Shortages and Blocking Requests Rejected counters in the Server performance object, you can detect any instances where File Server dropped requests due to resource shortages. You should then determine where Work Items are being queued by looking at the Server Work Queues object. Statistics for each processor queue are reported as separate instances, as illustrated in Figure 5-25. This is done to help with scalability and to avoid having to use a spin lock to control access to a shared work queue.

Figure 5-25. Monitoring File Server performance
Fig 25

The number of both available Server threads and active Server threads is reported per queue. These are instantaneous counters. All the items on the Blocking Queue represent active Work Items, so Server zeros all of the queue length statistics out for that instance of the Server Work Queue object. One thread per processor is assigned to the Blocking Queue to service I/O requests as soon as possible. If file requests are received and all available threads are busy, Server will start additional threads up to the value of MaxThreadsperQueue. The default value is 10 threads per queue in Windows 2000, and 30 threads per queue in NT 4.0. As noted in Chapter 3, there is also an option to adjust File Server thread priority.

File Server does not report response time or service time metrics, but, fortunately, you can calculate the average response time using Little's Law. Context Blocks Queued/sec represents the arrival rate of File Server requests. The number of requests in the system equals MaxRawWorkItems (which the System Monitor does not report directly) minus Available Work Items (which is available in the System Monitor). In other words, each active Work Item represents an active request. Unless you specifically set MaxRawWorkItems, the number created can vary significantly across configurations, as illustrated in Table 5-4. But even though MaxRawWorkItems is not directly available, you can observe the maximum value for Available Work Items, especially during a period of no file server activity. In Figure 5-25, for example, the maximum Available Work Items reported was 59.

Table 5-4: Maximum Raw Work Items for different OSes with different amounts of memory

OS

# of engines

MB

Max Raw Work Items

NTW 4 SP3

1

128

2

NT Server 4 SP4

1

128

127

NT Server 4 SP3

2

256

80

NT 2000 Server

1

256

127

With this information you can compute the average Server response time:

RT = (MaxWorkItems - Available Work Items) ÷ Context Blocks Queued/sec

For example, from the sample data illustrated in 5-25, the estimated response time values are calculated and displayed in Figure 5-26. We can see that the average File Server request here is processed in about 20 milliseconds. Note that the Available Work Items counter is an instantaneous value, while Context Blocks Queued/sec is an interval delta value. If the load on the file server is relatively steady and the measurement interval is quite small (we used 5 seconds in this example), the resulting calculation is trustworthy. Over intervals with much more variation in the number of Available Work Items, the calculated result is dubious because the necessary equilibrium assumption that Little's Law requires is violated. As always, be careful.

Figure 5-26. Using Little's Law to calculate the average response time for File Server requests
Fig 26

Transaction Server

With so many server applications utilizing thread pooling to achieve scalability, Microsoft initiated a development project to build a runtime environment suitable for most types of transaction-oriented client/server applications. Microsoft developers recognized that if they could prepackage a thread pooling interface into an appropriate set of runtime services, it might greatly accelerate the development of client/server application software for the Windows 2000 platform. This impulse ultimately led to the development of MTS, the Microsoft Transaction Server, now bundled into both Windows NT and Windows 2000. MTS runtime services include automatic thread pooling for both applications and ODBC database connections, plus a variety of shared (private) memory communication services that server applications typically employ. MTS also provides transaction back-out and recovery services.

Essentially, MTS offers built-in services to provide the basic building blocks that complex client/server applications normally require. With the availability of MTS, which has been absorbed into the COM+ development effort in Windows 2000, the number of thread pooling server-side applications available to run under Windows 2000 is expected to increase rapidly. Assuming that appropriate measurement data is available for MTS transaction processing applications, a tuning methodology similar to thread pool monitoring can be used to fine-tune performance of these applications over a wide range of hardware environments.

Coding practices

With MTS COM+ runtime services making it easier than ever to develop scalable thread pooling server applications for Windows 2000, it becomes necessary to say a few words about the performance and scalability of such applications. It is not surprising that Windows 2000 Server applications projected to run on large scale n-way multiprocessors often do not scale well. System administrators all too frequently find that, even though ample processing resources (as well as memory, disk I/O and networking bandwidth) are available, there are evident response time or throughput bottlenecks. Finding out where and why these applications bottleneck, and then figuring out how to fix those bottlenecks is not trivial.

One frequent source of performance problems in multithreaded server applications is poorly designed critical sections. If multiple threads are forced to wait excessively for critical section locks, the application will not scale well to large n-way multiprocessing hardware. The first problem is finding out whether this is a factor in the poor performance of your application. The second problem is figuring out how to fix it.

Diagnosing locking performance problems associated with large-scale, multithreaded applications can be difficult. For device driver code, Microsoft supplied the Thunk function within the x86 Perf Meter application that is also used to set and view Pentium counters. The Thunk function turns on instrumentation inside Windows NT device driver services, including kernel spin lock functions, as illustrated back in Figure 5-9. At the device driver level, you can use Thunk to select the specific NT kernel service that you want to observe. Excessive spin lock cycle consumption is also liable to drive overall processor utilization up and effective instruction execution rates down.

For multithreaded server applications that use Win32 synchronization services like critical sections and mutexes, monitor the application's threads while it is running using the System Monitor. Win32 synchronization services avoid utilizing spin locks on multiprocessors. You will probably need to run the System Monitor at very short intervals, perhaps collecting data once per second. Examine the Thread State and the Thread Wait State Reason. Threads waiting for a critical section or a mutex have a Wait State Reason of 7, Waiting for a Component of the NT Executive. Unfortunately, that is a very common Thread Wait State Reason. Judging from how little processor time threads consume based on the Thread's % Processor Time counter values and the frequency that you find threads in Wait State Reason 7, you may conclude that lock contention is a serious performance concern.

Profiling tools like Visual Quantify and vTune, discussed in the previous chapter, can also be used to pinpoint runtime locking bottlenecks. Visual Quantify works well for multithreaded single-process applications. You may have to rely on vTune for applications that span process boundaries. VTune can also find performance problems in device driver code. One overt symptom of lock contention problems is that processor utilization is relatively low and processor resources are readily available, yet increasing the number of threads does not boost application throughput. If your multithreaded application is instrumented and exports performance monitoring counters as described in Table 5-3, you will also be able to recognize that work queues are backing up while this is all happening.

Finding these performance problems is tough enough, but fixing them may be even tougher. Conceptually, threads queue up waiting for locks due to the rate of locking requests and the duration of those requests. Thread serialization and locking is a classic queuing problem. Optimization solutions along the lines of reducing the rate and duration of locking requests immediately suggest themselves. If requests for acquiring some pooled resource are queued, you may be able to reduce the rate of requests by allowing bulk requests.[5] Instead of acquiring these resources one at a time, you may be able to acquire them two, five, or even one hundred at a time. If making bulk requests cuts down on the rate of lock requests without making the duration of locking requests significantly longer, the overall impact on the performance of your application is beneficial.

With an eye towards reducing the duration of locking requests, you will want to examine the code inside critical sections very carefully, fine-tuning it to ensure that it can execute as quickly as possible. Breaking monolithic locking structures into finer-grained ones can solve many locking-related performance problems. Look for opportunities to replace a single lock that is heavily used by three, four, or more separate locks. In extreme cases, you may even need to do away with shared-memory data structures that are subject to heavy contention and replace them with separate per-processor data structures that do not require cross-processor locking, as Microsoft did with File Server.

Partitioning Multiprocessors

Beyond controlling multithreaded applications, another important tuning option for system administrators is partitioning large-scale multiprocessors to handle heterogeneous workloads. As we have seen, Windows 2000 shared-memory multiprocessors face serious scalability hurdles. As more processing engines are added to the system, the efficiency of the overall system diminishes due to the overhead of maintaining cache coherence and contention for the shared-memory bus. As we suggested at the outset, it makes sense to run Windows 2000 systems as dedicated application servers for a number of reasons, including some of the scalability and performance issues discussed in this chapter. But most shops would prefer not to have to maintain quite so many machines. For ease of administration, it is certainly advantageous to consolidate Active Directory and Domain controllers with machines performing file and print services so that each new redirected filesystem request does not require authentication from an entirely different machine on the network. Running messaging applications like MS Exchange or Lotus Notes on a few large-scale machines may result in reduced administration costs and less complex replication processing to manage. Partitioning offers a way to carve up large-scale multiprocessors into two or more logical machines that can be managed more efficiently than an undifferentiated n-way server machine.

Partitioning is accomplished in Windows 2000 Server by assigning specific processes to specific processors using the Win32 hard processor affinity API. We have seen that the Windows 2000 Thread Scheduler implements soft affinity by default. If the processor where the thread was last dispatched is free or currently running a thread with a lower priority, the thread in question is dispatched on that processor, known as its ideal processor. The reason for selecting the thread's ideal processor is performance. When the thread was last executed there, it began to accumulate its code and data in the processor cache. Once the thread has established its working set in processor cache, the instruction execution rate (IER) can be expected to be several times faster. The IER for a thread subject to a cache cold start is potentially only 10-20% of the IER expected when the same thread benefits from a warm start in cache. (The specific benefit of a cache warm start is, of course, workload-dependent, but the potential speed-up factor is considerable.) Because many program segments have very tight working sets, programs usually do not need lots of processor cache, either, once they get started. By dispatching a thread on the processor on which it ran last, Windows 2000 is automatically attempting to optimize for processor performance.

Now imagine a situation where you deliberately restrict an important application so that it can only run on a small number of processors using hard affinity. This is a particularly attractive option when you are configuring a large four-, six-, or eight-way multiprocessor. Restricting the threads of a process to a small number of processors increases the likelihood that when the threads are dispatched, they are dispatched on a processor that retains cached data from that process address space. In an extreme (but by no means unusual) case, you might consider restricting a process to run on only one of the available processors. In this case, the process's threads are always dispatched on the same processor, so it is highly likely that at least some portion of the process working set is still resident in cache when its threads resume execution. Remember that concentrating threads on a processor tends to improve the instruction execution rate. Consequently, it would not be unusual for a process that absorbed 80% busy across all eight available processors to need only 40% of one processor when its threads are all concentrated on that processor. Because the IER is boosted, the application actually runs twice as fast.[6]

Improving the IER of specific process threads on a multiprocessor is the rationale behind partitioning. Partitioning is used to counteract some of the multiprocessor scalability effects described earlier. It is an important technique for improving the cost-effectiveness of large n-way shared-memory multiprocessors.

NDIS PfsrocessorAffinityMask

Concentrating device driver interrupt handling code onto one specific processor is particularly attractive because performance issues associated with spin locks disappear. Device drivers must use spin locks to guard shared-memory data structures that can be accessed from separate threads running on different processors at the same time. We saw that a thread spinning in a busy wait queue waiting for another thread to finish processing inside a critical section may execute many instructions, but it is not actually doing any useful work. The innovative use of queued spin locks in Windows 2000 reduces shared memory bus contention during spin lock execution, but does not actually reduce the amount of time an application might wait for a spin lock to become free.

Now consider what happens when a device driver interrupt handler DPC is always dispatched on the same processor. There will never be any contention for the spin locks that protect shared-memory data structures. Because only one driver thread at a time can be executing, only one thread at a time will ever try and enter a critical section. The code that loops testing the lock word over and over again until it is free is never executed; since no other thread is in the critical section, the lock is acquired without delay!

For this reason, the Windows 2000 NDIS (Network Device Interface Specification) driver that controls your network interface cards (NICs) supports a hard processor affinity setting. At HKLM\SYSTEM\CurrentControlSet\ServicesNDIS\Parameters, you can code a ProcessorAffinityMask value, which is used to associate specific NIC cards with specific processors. When the NDIS parameter is nonzero, the deferred procedure calls (DPCs) that process NIC card interrupts are confined to a specific processor, one per card. You can code a simple 32-bit processor affinity mask of x'FFFFFFFF' (the system default value) and let the operating system set the processor affinity of each NIC card driver, or set a specific mask yourself. Under the default processor affinity mask, the system assigns the first NIC card to the highest number CPU in the system (CPU 7 on an eight-way, for example), then works its way down until all NIC cards have been assigned different processors.

The NDIS processor affinity mask was introduced in NT 3.5 when the operating system confined interrupt processing to CPU 0. The processor affinity mask allowed you to force DPC processing to other processors, lessening the load on CPU 0, which was the likely bottleneck. The NDIS driver processor affinity option that confines DPCs that process network interrupts can still be a useful setting on a Windows 2000 Server running symmetric multiprocessing where interrupts can be processed on any available processor. Suppose the Windows 2000 Server is serving as a high speed networking hub. Because DPCs are dispatched at a higher priority than any normal task, setting processor affinity virtually ensures that the DPC code stays resident in the processor cache, assuming there are frequent enough interrupts. This makes the NDIS DPC code execute very fast!

However, there is a drawback when DPCs are processed on a different processor from the one that processed the original interrupt. For example, assume that the NDIS DPCs have a hard affinity setting assigning them to CPU 3 in a four-way SMP. Whenever an NDIS interrupt occurs that is processed on a processor other than CPU 3 (say CPU 1), then CPU 1 must signal CPU 3 when it schedules the DPC. Interprocessor signaling, as we indicated previously, is an expensive operation. In addition, any data associated with the NDIS interrupt processing retained in CPU 1 cache must be flushed.

From the standpoint of processor instruction execution efficiency, better results are obtained when both the NDIS interrupts and the associated DPCs are dispatched on the same engine. This can be done, but requires a third-party package, for instance, the NCR SMP Utilization Manager.[7] Using the NCR SMP Utilization Manager, you can assign processor affinity to specific interrupts, as illustrated in Figure 5-27. In this example, the IRQs associated with the Intel 8042 10/100 Ethernet card are assigned hard processor affinity for CPU 1, the same processor assigned to NDIS DPCs, by default. Using the NCR package, we also configured interrupts associated with the Adaptec SCSI card on this NT 4.0 server to CPU 0 to balance the interrupt processing workload a bit. This configuration option works best when a server with multiple PCI buses is used, so that SCSI interrupts assigned to one of the buses can be processed concurrently with NDIS interrupts assigned to a different PCI bus.

Figure 5-27. NCR SMP Utilization Manager assigns processor affinity to interrupt servicing
Fig 27

Setting hard processing affinity in this manner runs counter to the spirit and intent of symmetric multiprocessing. The resulting configuration becomes asymmetric, which means that the workload is no longer balanced across processors. Figure 5-28 illustrates the imbalance that occurs on this machine when NIC interrupts are concentrated on CPU 1. % Interrupt Time is substantially higher on CPU 1 (the topmost line) because NIC interrupts are processed there exclusively. Since our goal was to concentrate the NIC interrupts and DPCs by isolating them to CPU 1, Figure 5-28 indicates we have achieved the desired result. The obvious danger of setting hard affinity is that confining any process to too few processors restricts its performance. This is not very risky in the case of binding DPC execution to a single processor because % DPC Time is normally quite low, and NIC interrupts only occur one at a time.

Figure 5-28. Hard processor affinity for Interrupts and DPCs leads to an unbalanced configuration
Fig 28

However, this is a risky strategy for any workload that might overrun the processors assigned to serve it. If there are not enough processors assigned (or the processors are not big enough), the workload will back up inside the system. Problems arise when you start to find critical application threads delayed in the Ready Queue even though overall processor utilization is low. The shape of the problems should become clear when you analyze individual processor-level statistics. You may find that the processors where constrained workloads are confined are quite busy, while others are relatively idle. Creating a severely imbalanced configuration is one of the serious risks of manually partitioning SMPs, and avoiding this requires a commitment to regular performance monitoring to ensure that resources dedicated to specific critical workloads remain adequate to the task. The next section discusses this issue in greater depth.

Guidelines for Setting Processor Affinity

In general, utilizing hard processor affinity to partition workloads on the system calls for:

Let's take a look at all three areas of concern.

Understanding the CPU requirements of any specific application entails monitoring its associated Process % Processor Time counters. You can learn about longer-term CPU usage patterns and trends by monitoring the application over several weeks at regular intervals. If processor utilization peaks at predictable hours, monitor those periods closely to understand the characteristics of those peak periods. You need to configure the system so that there are enough processor resources available to it for both normal and peak loads. Consider monitoring peak periods in even greater detail, for instance, at one-minute intervals instead of five, to gain additional insight into the application's resource requirements.

Deriving a peak:average ratio is a very useful metric for understanding the application's processing requirements, where the peak period is the busiest one- or five-minute interval over a typical processing shift, as compared to the average measured during the shift. The greater the variability in processor requirements, the higher the peak:average ratio. Workloads with a peak:average ratio of over 2 or 3 are very inconsistent and need to be watched extra carefully.

You also need to factor in some additional system overhead requirements beyond what is specifically consumed at the process level. No doubt, the application in question will require some additional system resources performed on its behalf. Server applications like SQL Server or Oracle, which attempt to bypass most Windows 2000 system services once they are initialized, usually require another 10-20% of additional system overhead. Other applications may require 20-40% more processor resources in the form of various system memory and I/O services.

Once you understand the processor requirements for a given workload, you can establish a partitioning scheme. To ensure that the system remains responsive to the workload in question, configure enough processors so that the projected average application load, including system overhead, puts the configured CPUs in the range of 20-60% busy. The rationale behind this rule of thumb is as follows. Below 20% utilization, the application is probably not dispatched frequently enough on its dedicated processors to benefit much from cache warm starts. Above 50% utilization, there may not be sufficient processor resources to handle peak loads (assuming a 2:1 peak:average ratio) without excessive queuing. Of course, you should also monitor the system object Processor Queue Length counter and periodically check to ensure that threads from the application you are concerned with are not delayed waiting in the Ready Queue (Thread State = 1).

To ensure that lengthy processing queues do not develop during peak loads, configure enough processors to keep projected peak load usage below 75% busy. If it is only possible to hit one of these two targets because of extremely high variability in the workload's processor requirements, then the peak load target is normally the more important one. You do not want users to experience serious response time problems during normal peak periods.

Once you establish the processor requirements for your applications, you usually also need to restrict other workloads that might conflict with your critical applications to processors other than the ones assigned to your preferred workload. To make life easier, narrow the field to only those workloads with significant processing requirements and assign them hard processor affinity on different processors.

Finally, using hard processor affinity settings to partition a large n-way Windows 2000 server requires a commitment to monitor that system continuously to ensure that changes in the workload do not render your carefully planned configuration inadequate. You must remain vigilant. If the workload changes in such a way that critical applications are being starved of processing resources they need, then you must adjust hard processor affinity settings accordingly. Once you set up hard processor affinity restrictions, the Windows 2000 Thread Scheduler can no longer balance processing across all available CPUs. One indication that a system with hardcoded processor affinity is out of balance is that the dispatcher Ready Queue gets longer even though average processor utilization remains low. On the other hand, you may find, by examining the threads waiting in the ready state, that this behavior is exactly what you intended. If threads from preferred applications are being executed promptly while threads associated with lower-priority applications are delayed, that is precisely the behavior you want.

SQL Server

Among Windows 2000 packaged applications, only SQL Server supports a processor affinity mask. Up through Version 6.5, use either the Configuration tab of the Configure dialog in the Enterprise Administrator application, or the sp_configure stored procedure. Note, the Show Advanced Options flag must be turned on to see these options. The processor affinity mask option is designed to complement the other advanced processor tuning parameters available in SQL Server. These include the ability to boost SQL Server threads to real-time priority above even OS kernel threads, to define the duration of a SQL Server thread time-slice, and to control the level of SQL Server thread concurrency.

On a multiprocessor, setting priority boost lets SQL Server threads run at priority 24 in the real-time range, higher priority than any OS kernel threads. When priority boost is coded along with a processor affinity mask, Microsoft documentation refers to this as SQL Server's dedicated SMP support because it allows SQL Server threads to run at higher priority than anything else in the system except ISRs, DPCs, and APCs. This is worth trying on an n-way multiprocessor only if you also set a processor affinity mask to enable the operating system and threads from other applications to get access to other processors not dedicated to SQL Server, as illustrated in Figure 5-29.

Figure 5-29. SQL Server 2000 can boost SQL Server thread priority into the real-time range
Fig 29

The SMP concurrency setting under SQL Server 6.5 controls the number of threads that SQL Server will release to Windows 2000 for execution. See Figure 5-30. By default, the value of the SMP concurrency parameter is set to 0, which means SQL Server releases n - 1 threads, where n is the number of processors detected. (On a uniprocessor, SQL Server is set to release only one thread at a time anyway.) If a processor affinity mask is set, SMP concurrency is set by default to -1, which means to release one less than the number of configured dedicated processors. There are many situations where you might need to restrict SQL Server running on an n-way multiprocessor even further. Again, analyze SQL Server's CPU requirements and set a processor affinity mask and an SMP concurrency value accordingly.

Together, the processor affinity mask, the priority boost, and SMP concurrency parameters can modify dramatically the performance of SQL Server running on a multiprocessor. None of these parameters should be set without first analyzing SQL Server's CPU processing requirements, as described previously in the "Guidelines for Setting Processor Affinity" section. A number of Windows 2000 professionals report having great success using these tuning options, especially in benchmark environments where the workload is very controlled. These parameters are one reason that SQL Server can run TPC-C benchmarks so well. When your Windows 2000 server is dedicated to running SQL Server, these advanced tuning options can be used in concert with the NDIS processor affinity mask. On the other hand, when you are trying to run SQL Server as one of several server applications sharing a large n-way multiprocessor, using these options can easily do a great deal of harm. In the rough-and-tumble world of real-life workloads, it is necessary to proceed much more conservatively because of the potential impact on real customers and their service levels.

Figure 5-30. SQL Server Version 6.5 advanced options include processor affinity mask, the priority boost, and SMP concurrency parameters
Fig 30

NCR SMP Utilization Manager

To make any headway at all using partitioning on a large n-way Windows NT server outside of the realm of dedicated SQL Server machines, you probably need NCR's SMP Utilization Manager (SUM). This utility was originally designed to accompany NCR's high-end Worldmark server hardware, which can be configured with multiple Intel processors. Initially, NCR found it difficult to run its largest server configurations efficiently under NT Version 3. It developed the SUM utility program to add administrator controls to the NT Thread Scheduler. The program remains just as useful in an NT 4.0 symmetric multiprocessing environment in case you want to consolidate several workloads and you configure three, four, or more engines. (A good deal of the collective NCR wisdom on running large-scale NT servers is condensed in Curt Aubley's Tuning and Sizing Windows 2000 for Maximum Performance, from Prentice Hall Computer Books, 2001.) Currently, the SMP Utilization Manager is available from NCR in a bundle called Enterprise Pack, which includes some other tools designed to help administer very large-scale Windows 2000 systems. Meanwhile, the SUM program itself has evolved into an effective tool for partitioning any large n-way NT Server. (As we were going to press, the NCR SUM program was still not available for Windows 2000.)

As illustrated previously in Figure 5-27, the SUM utility can be used to assign hard processor affinity to different IRQs and their associated ISRs. This can be used, as illustrated, to direct interrupts from SCSI cards and NICs to separate processors, or, along with the NDIS ProcessAffinityMask, to assign both interrupt processing and DPCs associated with specific NIC cards to a specific processor. As a more general-purpose partitioning tool, the program can be used to set both the priority and processor affinity of any service or application process. Figure 5-31 illustrates the capability of the program to create a static configuration file to detect specific processes at initialization and assign them designated priority or processor affinity settings. As of this writing, the program is unique in providing this valuable capability to assist NT system administrators in configuring and controlling a large multiprocessor NT server.

Figure 5-31. The NCR SMP Utilization Manager Detect Processes tab
Fig 31


1. Neil J. Gunther, The Practical Performance Analyst. New York: McGraw-Hill, 1998.

2. For instance, see the review article on eight-way scalability in the September 1998 Windows NT Magazine at http://winntmag.com/Magazine/Article.cfm?IssueID=58&ArticleID=3781.

3. Several examples in this chapter illustrate the use of the older Pentium counter tool Microsoft provided in the NT 4.0 Resource Kit. This utility is no longer available in the Windows 2000 Resource Kit--in fact, the 4.0 version blue-screens on a Windows 2000 machine. The Pentium counter tool in the NT 4.0 Reskit did not always work as designed anyway. There were many environments where you could not succeed in getting the P5ctrs.dll Perflib DLL to report on Pentium counters through Perfmon. On a multiprocessor with multiple engines, the P5stat.sys device driver was supposed to enable P6 measurements on multiple processors, reporting multiple instances for the various counters. Using the x86 Perf Meter program, P5perf.exe, however, you are only able to look at one processor at a time, and you cannot tell which one. As a Resource Kit tool sold "as is" with no support, Microsoft was under no obligation to fix these deficiencies. When Microsoft dropped support for its Pentium counters tool in Windows 2000, it was fortunate that Version 2 of CPUmon that interfaces to the performance monitoring API was available from another supplier.

4. Note that borrowing a work item on occasion from another processor's queue complicates the per-processor work queue implementation. The borrowing thread must gain access to the work queue using a spin lock, of course. This necessitates building a critical section that protects the processor queue for all accesses that change its status. But since borrowing is a rare occurrence, normally only one thread at a time is ever executing the critical section.

5. Update transactions delayed in a queue waiting to receive a unique serial identification number is the classic example. One well-known solution is to create intermediate brokers that acquire groups of IDs in bulk. Instead of transactions queuing for a single component that assigns IDs serially, transactions request a unique ID from one of the brokers. Having multiple brokers, in effect, establishes multiple queues, which removes the bottleneck.

6. Cache effects like this are notorious for creating havoc in workload characterization and benchmarking, as exemplified in trying to obtain repeatable performance benchmark results. Because of cache effects, increasing the rate work arrives often leads to processing efficiencies where the amount of work consumed per transaction actually decreases. Consider a workload transaction that runs in isolation. It is subject to a cache cold start and takes duration N to process. A second transaction that can share the cached data associated with the first transaction benefits from a cache warm start, and only takes a fraction of the time N to process. Throughput is doubled while average service time actually decreases. Due to cache, processing efficiency may continue to increase as the arrival rate of transactions increases. Contrasting cold versus warm start cache performance in processors is hardly the only place where we see this counterintuitive phenomenon.

7. As of this writing, the NCR SMP Utilization Manager is not available to run under Windows 2000.

Back to: Windows 2000 Performance Guide


oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy

© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com