Search the Catalog
Building Linux Clusters

Building Linux Clusters

By David HM Spector
1st Edition July 2000
1-56592-625-0, Order Number: 6250
352 pages, $44.95 , Includes CD-ROM

Chapter 2
Basic Concepts

This chapter deals with some of the fundamental issues behind clustered computing, including answering the basic question: Why Clusters? We will also cover some basic concepts involved in parallel computing, program optimization, and some basic network technology. This background will enable us to work from a common vocabulary when we move on to the actual design, installation, and configuration of clusters.

If you are familiar with clusters, parallel programming, and networking terminology, you might wish to skip ahead to Chapter 3, Designing Clusters.

Why Clusters?

Why bother with the hassle of designing and building clusters when there are perfectly good commercial supercomputers available on the market? The short answer is: money. Do you have several million dollars you'd like to tie up?

Clustered systems were first explored by Don Becker and his colleagues at NASA because budgetary restrictions precluded them from access to the kind of commercial supercomputer they needed to perform complex analysis of the very large space data sets that NASA missions tend to generate. They found a way to get the computational performance they needed without committing millions of dollars they didn't have. They named their creation "Beowulf" after the mythic hero of tenth century English lore.

Clusters are surprisingly powerful. There is a semi-annual listing put together by the University of Manhiem in Germany that describes the top 500 supercomputers in the world[1]. Until 1997, almost all of the systems listed were commercial supercomputer systems from well-known manufacturers such as Cray, Silicon Graphics, and IBM. In 1998, something extraordinary started to appear on this list: Linux-based parallel clusters. Two of the systems near the top of the list--number 97, called "CPlant" developed by Sandia National Labs, and number 113, called "Avalon" developed by Los Alamos National Labs--are Linux-based Beowulf clusters.

The supercomputer has also come to play a larger role in business applications. In areas from data mining to fault tolerant performance, clustering technology has become increasingly important.

Commercial products have their place, and there are perfectly good reasons to buy a commercially produced supercomputer. If it is within your budget and your applications can keep the machine busy all the time, you will also need to have a data center to keep it in. Then there's the budget to keep up with the maintenance and upgrades that will be required to keep your investment up to par. However, many who have a need to harness supercomputing power don't buy supercomputers because they can't afford them.

And then there is the upgrade problem.

By the time your machine is delivered, it is often out-of-date and a newer model will have taken its place. If you would like to upgrade your machine, the manufacturer gives you limited options, short of buying "next year's model."

Clusters, on the other hand, are a cheap and easy way to take off-the-shelf components and combine them into a single supercomputer. In some areas of research clusters are actually faster than a commercial supercomputer. Clusters also have the distinct advantage in that they are very simple to build using components available from hundreds, if not thousands, of sources.

You don't even have to use new equipment to build a cluster. One of the most interesting stories in the Internet community of cluster developers is that of the "stone supercomputer," that was built by a group at Oak Ridge National Labs. Oak Ridge built a Beowulf cluster, and rather than spend any money on the hardware at all, they solicited donations from other groups in their organization. So, for no investment other than their own time, they built a several-dozen node system out of discarded 486 machines, slow Pentiums, and other systems. It may not have been pretty, or have had the latest, greatest processor, but it got the job done.

Once you understand the basics, clusters are simple to build. Unlike commercial products, they are independent of a single vendor or single source for equipment. The most economical aspect of parallel clusters is that they can be built from commodity hardware.

"Commodity hardware" includes two distinct kinds of computers. First, there is commercial, off-the-shelf systems from any well-known desktop or server PC manufacturer such as Dell, Compaq, or Micron. If you want to build a cluster for your business and you have little experience building things from kits, I would suggest that you buy systems for your cluster from a manufacturer you already know or with whom you have a relationship.

The second meaning of "commodity hardware" is that you can, literally, buy hardware as a commodity. This means buying motherboards, cases, memory, disk drives, and so on in bulk, and then using these components to build the individual systems that will become the elements of your cluster.

Building systems from scratch is not for everyone, but if you take the plunge, you can build a completely customized cluster that is tuned specifically for your applications and can be upgraded and enhanced more cost effectively than a cluster made from off-the-shelf commercial systems.

If you are willing to put it all together yourself, you can save 50-70 percent of the cost of an off-the-shelf system, but this requires a more in-depth understanding of all of the various subsystems that make up high-end systems, and a willingness to roll up your sleeves and build computers from the system board up.

Whichever way you choose to start your cluster, either with commercial systems or with ones your build yourself, you can always upgrade your cluster to increase its performance. You can do it either as a whole, or a node at a time by simply adding more systems or upgrading components.

Clustering Concepts

Clusters are in fact quite simple. They're a bunch of computers tied together with a network working on some large problem that has been broken down into smaller pieces. There are a number of different strategies you can use to tie them together. There are also a number of different software packages that can be used to make the software side of things work. But first, let's define the major concepts and themes that you will encounter as you explore and build these systems.

The Big Picture: Parallelism

The name of the game in high-performance computing is parallelism.[2] Parallelism is the quality that allows something to be done in parts that work independently rather than a task that has so many interlocking dependencies that it cannot be further broken down. Parallelism operates at two levels: hardware parallelism and software (algorithmic) parallelism.

Hardware Parallelism

On one level, hardware parallelism deals with the CPU of an individual system and how you can squeeze performance out of sub-components of the CPU (including caches, pipelines, and multiple execution units) that can speed up your code. At another level, there is the parallelism that is gained by having multiple systems working on a computational problem in a distributed fashion. These systems are known as either "fine grained" for parallelism inside the CPU or having to do with multiple CPUs in the same system, or "coarse grained" for parallelism of a collection of separate systems acting in concert.

CPU-level parallelism

A computer's Central Processing Unit (CPU) is commonly pictured as a device that operates on (or executes) one instruction after another in a straight line, always completing one step or instruction before a new one is started, as shown in Figure 2-1. For example, a single instruction might load a value from memory location 12345 and store it in a CPU register.

This serial execution continues until the CPU is forced for some reason to follow the code as it branches someplace else (as in a subroutine call) to start executing another set of instructions.

In older machines, like the DEC VAX, the Intel 8088, and the Motorola 68000, this is exactly what happens; the flow of instructions is very deterministic and predictable. At any given moment, there is exactly one instruction being executed by the CPU.

Figure 2-1. Instructions stream on a CISC machine


Newer CPU architectures, such as the Sun UltraSPARC, the DEC/Compaq Alpha, and even the Pentium II/III, have an inherent ability to do more than one thing at once. The logic on the CPU chip divides the CPU into multiple execution units.

An execution unit is the part of a computer's circuitry that actually does the work we associate with the "compute" in the word "computer." It moves data from memory into the internal storage of the processor (called the registers) and then performs the operations that the particular instruction calls for (ADD, SUBTRACT, MULTIPLY, etc.). Then it sets flags to indicate the result of the instruction and moves results for the register back to memory, and so on.

Systems that have multiple execution units allow the CPU to attempt to process more than one instruction at a time. I use the word "attempt" deliberately because part of the process of executing multiple instructions includes keeping track of the instructions that are being executed, the results of the operations in terms of registers that may have been modified, condition codes that are set, and branch instructions that are to be called. If, for some reason, the execution of an instruction would cause the CPU to have to stop what it's doing and branch to some other place in the program, it's quite possible that all of the other instructions being worked on in other parts of the CPU would be invalidated and discarded.

Although it seems like a waste of work, most computer programs are very linear entities. Well-written programs (or, more to the point, well-optimized code from a high-quality compiler) will arrange binary executable code so that it spends as little time branching as possible. This allows the CPU to work more efficiently; preprocessing the instructions that might not be executed is more efficient than leaving CPU resources idle.

Two additional hardware features of modern CPUs support multiple execution units: the cache. The pipeline is a small area of memory inside the CPU where instructions that are next in-line to be executed are stored. The pipeline gives the CPU a chance to get ready to execute the instructions. This is done by analyzing (or decoding) the instructions and getting (or fetching) any additional information the CPU will need to actually operate on the data for these instructions.

Instruction pipelines help keep work flowing to a CPU and, for optimized code, will keep the CPU busy as long as the code doesn't have to jump too far away to execute something that isn't stored in the cache.

The cache is a much larger area of memory, often several megabytes, where large portions of programs are stored in anticipation of eventual execution by the CPU. Cache memory is orders of magnitude faster in terms of access time than the system's main memory, so getting instructions out of the cache and putting them in the instruction pipeline is only slightly slower than accessing data that is actually inside the CPU in registers.

Taken together, the cache, the pipeline, and the CPU itself act like a coordinated assembly line for instruction execution.

Such parallelism allows impressive increases in CPU performance. It also requires a lot of intelligence on the part of the compiler to arrange the executable code in such a way that the CPU has a good chance of being able to execute multiple instructions simultaneously.

It's not important that you understand all about the theory behind this level of CPU parallelism, but for the purpose of building and operating a cluster, you should understand that well-optimized code, along with an efficient CPU, can help speed up your applications.

System-level parallelism

At a higher level, system-level parallelism is the crux of what clustered computing is all about. It is the parallelism of multiple nodes coordinating to work on a problem in parallel that gives the cluster its power.

There are other levels at which even more parallelism can be introduced into this system. For example, if you decide that each node in your cluster will be a multi-CPU system (called a symmetric multiprocessor) you will be introducing a fundamental degree of parallel processing at the node level. Having more than one network interface on each node introduces communication channels that may be used in parallel to communicate with other nodes in your cluster. Finally, if you use multiple disk drive controllers in each node, you create parallel data paths that can be used to increase the performance of the I/O subsystem.

Each decision you make will have an effect on the hardware level of parallelism on your cluster. As you design your cluster, you will make various decisions about how many nodes you will configure the system with, how many CPUs per node, and how many network connections and disk drives or controllers will be connected to each node.

All of these areas of hardware parallelism can be used to great advantage in the solution of your problems on a cluster, but only if your problem is amenable to them.

Memory systems

Memory is, next to the CPU, where all the action takes place in a cluster or any computer. Having too little of it means that the computer's operating system will take up most of the space, leaving very little room for your program, causing your program to endlessly thrash around trying to find enough resources to run. Having too much of it means...well, you can never have too much memory in a computer.

There are two kinds of memory we are concerned about: cache memory and main memory.

Cache memory

Cache memory is a specialized kind of memory that is almost an extension of the register set of the CPU. Cache memory is designed to be very fast (access times are on the order of nanoseconds). Chunks of a running program are stored in the cache so the CPU can access it immediately without waiting for access to the main system memory, which can be several orders of magnitude slower than the cache. As pieces of the program (and data) are retrieved for eventual execution by the CPU, more code/data is pulled in from main memory or from disk to replace it.

The major role of the cache is to help keep the CPU's instruction pipeline full; again, this is to keep the CPU from having to go to the slower main memory system to fetch more instructions to execute.

Main memory

This is where your programs are stored after they are read in from some semi-permanent form of storage such as hard disk or tape. The memory system of most machines is governed by three major factors: the memory access speed, the number of memory wait states, and the system bus speed.

Even though the CPU is the focus of the computer's computational activity, there are a lot of other things going on. Not surprisingly, each of them, ultimately, has an effect on the performance of your system and, by extension, your cluster.

The memory access speed is the amount of time it takes for a memory chip to recover after a read or a write. Even though action can seem instantaneous on a computer, there is a lot of "waiting" going on. In a memory chip, when something is read from or written to memory, time must pass before that location can be written to or read from again. This is because the way most common forms of computer memory are designed, it takes time both for a memory cell to be refreshed with a new value once it has been written to, and for the memory cell to be refreshed if it has been read from. Depending upon the density of the chips, and a number of other factors, this speed can be as fast as forty nanoseconds, or as slow as one hundred nanoseconds (100ns). One hundred nanoseconds seems like a short amount of time, but if the CPU has to wait that 100ns for a frequently accessed memory location, it can really slow things down. The time required to refresh a memory location is also known as the refresh rate.

Wait states are delays forced upon a CPU in order to better synchronize itself with a memory system. Systemboard designers put delays into the hardware to ensure that the CPU won't falsely sense a hardware failure when it attempts to get a value from memory and that memory location is still in the middle of a refresh cycle. Wait states are a necessity in many systems because of the great disparity in speed between the CPU and main memory; this is also why cache memory exists, to help add a mediation layer between worlds of slow main memory and the much faster world of the CPU.

The system bus speed is a measure of how fast information can be moved between the CPU and peripherals, such as a display controller or a disk controller. Most high-end x86 style machines support a 66MHz bus speed. Newer models support 100MHz. The difference is in how much time the CPU spends waiting for its peripherals to complete commands given to them and how fast those peripherals can send process information sent to them over the system bus. The faster the bus speed, the busier the CPU can be when it's accessing peripheral devices. A peripheral might complete an action requested by the CPU and itself be waiting to finish up a request, and find itself at the mercy of a system bus that is slower than its own ability to process information.

Some systems use specialized memory systems to allow any CPU in the system to have fast unhindered access to any range of memory. These are called crossbar switches.

A crossbar switch is a device that is designed like a grid of streets whose traffic flow is controlled by traffic lights, as shown in Figure 2-2. The role of a crossbar switch is to arbitrate access to the resources connected to it. In the example of a fast computer system, crossbar switches are used to provide uniform access between CPUs and memory systems where none of the CPUs have to explicitly manage the memory systems.

Figure 2-2. Crossbar switch


In short, a crossbar switch acts like a traffic cop in a large city whose job is to ensure that a given driver can get his car from point A to point B without encountering any red lights or stop signs. A crossbar switch performs this function for CPUs and memory. It allows a CPU to access memory systems without being concerned that there may be other CPUs nearby that are also accessing other areas of memory. All of this happens "behind the back" of the CPU that just accesses the memory area it is interested in as though it were the only CPU in the system.

Not having any of the CPUs manage memory access from the other CPUs is a big win. Such memory management is very complex to program and generates a lot of overhead in the operating system. That's the "good news."

The "bad news" is that crossbar switches are expensive to design and build. Also, there's a practical limit to how large you can make a crossbar switch before the internal logic of the switch itself becomes a source of communications overhead between the CPUs and memory systems connected to it.

Input and output

A final aspect of hardware parallelism is input and output, or simply "I/O." Unlike a single processor being used by a single person, parallel clusters need their data input and output to be as fast as possible, otherwise the CPU can get tied up waiting for data to be read or written.

In a single CPU, there is usually only one process accessing a given disk at any given time. Even if there are multiple programs accessing a device, the wait is usually short enough that the performance impact on any one program is minimal.

On a parallel machine such as a Linux cluster, it is important to balance input and output across as many channels as are available so that data is handled efficiently and all of the nodes on the cluster work as close to peak efficiency as possible.

This balancing act also applies to the networking components in the cluster as well. Unless you have the budget to buy expensive Redundant Arrays of Inexpensive Disks (also known as "RAID" mass storage devices) and very fancy disk controller cards, it is probable that the data that your cluster processes might come from some other host on your network, or that the result of your clustered applications will have to be delivered to another host on your network. A well-balanced system will have enough network interfaces to allow all of the nodes on the cluster to send and receive data in an effective and efficient way that will not create a bottleneck for the rest of the processing elements of the node or for the cluster as a whole.

I've talked a bit about the CPU and how parallelism in the CPU can speed up processing, and hinted that collections of systems are what will make a Linux cluster work. But what else is there to parallel processing on such a system? A CPU without a supporting infrastructure is pretty useless, so I'll talk a bit about the peripherals and software that allow everything to work together.

Software Parallelism

This brings us to the really hard part of problem solving with a cluster (or, in fact, a commercial supercomputer): software parallelism. Software parallelism is the ability to find well-defined areas in a problem you want to solve that can be broken down into self-contained parts. These parts are the program elements that can be distributed and give you the big speedup that you want to get out of a high-performance computing system.

Before you can run a program on a parallel cluster, you have to ensure that the problems you're trying to solve are amenable to being done in a parallel fashion. Unfortunately, parallel computing is not as simple as just writing a program and saying "make." In fact, some classes of programs just cannot be run on a parallel computer.

Many classes of problems can be analyzed to find the inherent parallelism within them. Some of the more obvious problem domains include weather modeling, stock and options modeling, rendering of computer generated imagery, and various kinds of simulations.

I say these are "obvious" because they are the kinds of problems that contain "sub-problems" within them. For example, weather modeling involves solving problems relating to the heating and cooling of the atmosphere, the absorbing of sunlight by clouds and cities, as well as the movement of pressure fronts.

Financial modeling is a complicated system that, like weather modeling, requires a large number of sub-problems to be solved in order to gain an understanding of a larger picture of some element of finance. There are currency fluctuations, the availability of basic materials, and even the "value" of public sentiment.

Almost any problem that is composed of smaller sub-problems that can be quantified can be broken down into smaller problems and run on a node (or nodes) on a cluster.

The key to finding parallelism is finding the dependencies in a problem and isolating them. For example, in a program that processes data in some sort of loop, it makes little sense to have the process that reads the data from wherever it is coming from, read that data from inside the body of the loop. A simple example of this can be seen in the following code fragment:

int i,j;
for (i = 1; i <=10000; i++) {
      read_data(someDevice, &i);

This fragment of code processes data, one piece at a time, reading then processing it until it has completed 10,000 iterations.

Even though this example is only four lines long, it is ripe for optimization. If all the data were already read into the system, the process_data() routine could be re-written to process all of the data at once. As presented in this example, it is executed 10,000 times. While this might seem like a trivial example, the overhead involved in calling and executing this loop and calling the enclosed subroutines is actually quite substantial. Registers have to be saved, data is copied, operated on, and copied somewhere else, and so on. Potentially, at each call of the read_data( ) and process_data( ) routines, the CPU also has to pull in a program from main memory, as opposed to operating on data and instructions in the CPU cache. The major point here is that finding parallelism and places for optimization in most software is quite easy if you look at your application carefully and look for ways to optimize operations.

There are a lot of software tools and paradigms at the disposal of programmers looming to increase the performance of their applications. We will examine software profiling tools later when we get to Chapter 8, Programming in a Parallel Environment. As a basic concept of clustered computing, you should walk away with the idea that well-defined problems and well-written code is where the performance benefits of a cluster (or a traditional supercomputer) come from.

Underlying concepts

In the previous example, we looked at a small code fragment that, with appropriate refinement, could be made more efficient and even allow some of its work to be done in parallel. Before we try to optimize that fragment to take advantage of the parallelism we know is there, there are some underlying concepts that will help you to take advantage of it when you find it.

Granularity is the amount of work that can be done at a given scale of computation between times that some level of synchronization must occur.

For example, in a pipelined CPU, while an instruction is being executed, another can be in the process of being decoded while a third is being fetched. On an SMP machine, you can divide a loop among processors, syncing up only at boundary points in the loop. Or, on a cluster of workstations, an image processing application can split the work of rendering a frame among sixteen workstations, each of which works on its own local piece of the frame, and the pieces are recombined at the end of the process.

Dependencies are the points where one piece of code depends upon the results of some other action. Dependencies have two forms: data dependencies and control dependencies.

Data dependencies
A data dependency exists where some operation cannot proceed until data becomes available as a result of some other operation. For example, in the code fragment:

i = b+2*sqrt(a) j = 24 * i;

the computation involving j cannot continue unless i is available. This is a fairly obvious dependency, and isn't an issue for non-parallel code. But for a parallel application, this kind of dependency creates a need to localize bits of code like this so they can be executed serially, not executed in parallel, which would result in incorrect results (or at least cause one parallel task to have to wait for the other).

Control dependencies
Control dependencies are even more common. These are dependencies that relate to one thread of control being tied to another. For example, the code fragment:

if (needToLoop == 1) { for (i = 1; i <= maxVal; i++) { /* do something... */ } }

has a direct dependency between the test in the "if" statement, and whether or not the loop is ever executed.

There are also problems that are not as easily made parallel. Some kinds of database queries and problems that have large dependencies on real-time user input are hard to run on clusters. These are often issues that can be handled by a parallelizing compiler. You can help the compiler out by crafting your code so that there are as few data and control dependencies as possible.

Optimization Versus Parallelism

Many people merge the concepts of parallelism and optimization and mistakenly think that they are interchangeable. So, what is optimization?

Optimization is the art of writing or generating (if from a compiler) code that has as few wasted instructions and that branches as little as possible. The longer a program can run without stopping to go "someplace else" (which entails saving variables, registers, etc.) or wasting time executing instructions that are not part of whatever problem is being solved, the faster that code will be.

Compilers optimize your code by using all sorts of analysis to find tricks to reduce the complexity of the machine code that is generated as the output of the compiler. This reduction in complexity can be simple things like reducing a common subroutine that is called 10,000 times to some code that is inserted inline 10,000 times rather than being branched to. For small, common subroutines, this process is often more efficient than the overhead involved in calling a subroutine.

Another compiler trick is to arrange the machine code so that commonly called subroutines are stored near each other when a program gets loaded. This speeds up programs by allowing a subroutine call to branch to something that is often already in memory, and not to something that has to be read in from a slower system such as a hard disk drive.

Optimizing compilers can also try to reduce mathematical expressions to efficient forms that keep operands in CPU registers rather than load them from memory.

This is just the tip of the iceberg in optimization. There are several dozen wonderful books and countless PhD theses written every year on this topic. For most applications, it's not important that you are an expert in these concepts, rather simply that you know of their existence.

Parallelism involves finding the sub-units of work in a given problem and executing a program on hardware that can support farming out work and joining results. Optimization plays an equally important role in single processors as it does in parallel processors--the goal in both cases is to make programs more effective by making them execute more efficiently.

In the context of parallel processing, optimization and parallelism are two sides of the same coin: finding the sub-units of work in a piece of code, and then making those bits of code as efficient as possible.

Multiprocessor Software Concepts

Clusters are fundamentally all about multiprocessing of various flavors, either inside the same system or as collections of systems. There are a number of different concepts relating to how processing is done on such a system that you should be aware of.


Mutiprocessing is the idea that the control program of a computer, what we usually call the "operating system," can support (load/unload/schedule) multiple, independent, simultaneous threads of control. If the operating system can support multiple threads of control, support each thread of control, or process in an independent fashion, we can describe that operating system as supporting multiprocessing.

You'll notice in the definition above that I give special emphasis to the word independent. In order to support multiple processes, an operating system must give the user program the illusion that it's running on its own computer, complete with its own registers, stack, data space, and so on. Additionally, this space must be free from interference from other programs running on the same system. Operating systems that do not enforce process spaces that are protected from one another are not truly multiprocessing systems because any badly behaved program can crash the whole system.

Threaded programming

Once you have an operating system that can support independent processes, the next logical step is to allow something called user space parallelism. User space parallelism is more commonly referred to as threading. The operating system provides the abstraction of a process that gives users their own view of an entire computer. Threading gives yet another layer of process abstraction, but this time solely in the user's space. Threading is a way of writing programs that have a controlling portion and subordinate, or child, threads of control that share resources.

Threads are often called "lightweight processes" because they usually don't have all of the protections that a regular operating system process has, such as complete memory partition from other threads running in the same process. They are usually also forced to live within the bounds of runtime scheduled for their parent process. Child threads belonging to a process also live in the stack space of their parents. This is efficient because a whole system-level process doesn't have to be created, and the process stack is a memory that's pre-allocated to a process so creating threads is not expensive in terms of memory allocation.

Additionally, since threads are not truly standalone processes, they must cooperate with each other in order to get their work done. This means that threads call special routines to yield control back to their controlling process, and that they must take out locks (called semaphores) to ensure access to shared resources such as variables.


Synchronization is the process of putting disparate processes back in step. For example, a weather simulation, where there may be 1,000 processes, each computing a different aspect of local weather phenomenon. At some point, results of these calculations must be brought together to allow a larger result to be created. Synchronization is the process that takes place when merging the results of some process must occur.

Now that you've gotten some of the operating system and parallel processing jargon and concepts down, we can move on to the really interesting stuff: how all of these concepts are used to make high-performance parallel processors.

Parallel Processing Schemes

The area of parallel processing has been an active area of research for many years. There have been a number of different approaches to creating effective parallel computers, and all of them (including parallel Linux clusters) have different levels of effectiveness for different kinds of problems. Some of the best known methods are:

Some of these names may be familiar if you have studied computer science. They represent the most common names given to different kinds of parallel computing systems.[3] Some of these architectures are subsets of other kinds; which ones are subsets will become clear as we examine each one.

SMP machines

Symmetric Multi-Processor machines have more than one CPU, and each CPU has access to the memory system and all of the attached devices and peripherals on the machine. These systems usually have a "master" CPU at boot time, and then the operating system starts up the second (and higher) CPU(s) and manages access to the resources that are shared between all of the processors.

SMP machines allow programmers to write both multiprocessing and multithreaded applications.

When they first appeared in the 1980s, commercially available SMP machines (as opposed to experimental machines built in government labs or at universities) were very expensive. This was primarily because all of the other components that had to be put into a system to support multiple CPUs themselves were quite expensive. For most of the 1980s, things like memory, disk drives, and other supporting hardware were quite expensive and not yet commodity items. With the steep price drops and commoditization of computer systems, SMP machines now have become very popular and affordable.

Most SMPs available in the consumer marketplace have the capability to have two CPUs in a system, although at the time of this writing some four and eight CPU systems are starting to show up.

For commercial use, manufacturers like Sun Microsystems sell SMP systems that can have dozens of processors. As you add more processors to the SMP systems and add hardware such as crossbar switches to make the memory access orthongonal, these parallel processors become something called Massively Parallel Processors or "MPPs."

The following machines are special varieties of SMP machines that have made a leap from small-scale parallel processing to large-scale parallel processing.

NUMA machines

Non-Uniform Memory Access machines are very much like UMA machines in that every processor has unrestricted access to memory; however, unlike UMA machines, not all parts of the NUMA memory system have the same performance profile. Usually this is because NUMA machines have local fast memory on a per-CPU basis, and then a large, slower shared memory system.

UMA machines

Uniform Memory Access machines are those machines where each processor has equal priority access to the main memory of the machine without arbitration from a master processor. This is usually accomplished by means of a crossbar switch.

SIMD machines

Single Instruction Multiple Data machines are unique. They are a class of machine that contains many processors (usually hundreds or thousands) where the same program, down to the instruction, is executed on each of the processors. Each processor is working on a unique piece of data. SIMD machines usually require very specialized memory and bus architecture in order to be effective. One of the most famous SIMD machines was the "Connection Machine" made by Thinking Machines Corp.

MIMD machines

Last, but not least, are Multiple Instruction Multiple Data machines. As the name implies, these machines operate with multiple instruction streams and multiple data streams. In essence, parallel computers of this type are individual machines that by some mechanism, usually some set of software libraries, cooperate to solve a computational problem.

Linux clusters are, by definition, MIMD machines.

Networking Concepts

Linux clusters are simply collections of computers that are connected to networks that take advantage of special software systems to agglomerate the power of individual machines to create computing capabilities like those of traditional supercomputers. Networking systems allow computers to be connected together in systems that allow individual participants to communicate either individually computer to computer, or in large groups where one computer broadcasts some piece of information to others. In order to build these systems, a basic understanding of how networking systems are constructed is necessary.

Networking technologies consist of four basic components and concepts:

Network protocols
Protocols are a set of standards that describe a common information format that computers use when communicating across a network. A network protocol is analogous to the way information is sent from one place to another via the postal system. A network protocol specifies how information must be packaged and how it is labelled in order to be delivered from one computer to another, in much the same way that a letter is put into an envelope and then addressed with a destination address and a return address for delivery by the postal service.

Network interfaces
The network interface is a hardware device that takes the information packaged by a network protocol and puts it into a format that can be transmitted over some physical medium like Ethernet, a fiber-optical cable, or even through the air using radio waves.

Transmission medium
The transmission medium is the mechanism through which the information bundled together by the networking protocol and transmitted by the network interface is delivered from one computer to another. This can be something as simple as a twisted pair of wires (as in the ubiquitous 10BaseT networking standard) or as esoteric as a wireless network system that uses radio waves to move data from place to place. For physical media such as wires and fiber-optical cables, the data received from and sent to network interfaces is in the form of on and off pulses that are either volutes on a wire, or pulses of light. For radio-based networks, the ones and zeros are translated into modulated radiowaves on a given set of frequencies.

Bandwidth is the amount of information that can be transmitted over a given transmission medium over a given amount of time. It is usually expressed in some form of "bits per second." Depending upon what medium is being used, bandwidth can range from hundreds of bits per second, which is agonizingly slow (fortunately, this is a rarity these days), to trillions of bits per second. Most networks in use in the late 1990s are from 10 to 100 Mbits/second for home and office networks to many gigabits per second for long-haul commercial voice and data networks.

For the sake of simplicity, we will stick to the most commonly used kind of networking system, Ethernet. Ethernet is a network technology that was developed at Stanford University in the 1970s that uses a very simple strategy to deliver information from one computer to another. Ethernet specifies a set of message formats (called packets) for information that is transmitted over the network that starts out with some address information, the data that is to be sent, and then some trailing information that can be used to determine if the packet has been damaged in transit. The most common available bandwidths for Ethernet networks are 10Mbits/second and 100 Mbits/second.

A Simple Network

A good way to explore the networking components actually at work is to look at the simplest cluster you can make, which is simply a network with some machines connected to it, such as the four-node network shown in Figure 2-3.

Figure 2-3. A simple four-node network


Each computer that participates in the network has a network interface, usually some kind of an expansion card designed for that system that supports a particular network medium.

The computers all talk to a central object called a hub. The hub is a device that allows each of the computers on the network to communicate with one another without actually being physically connected--the hub acts as a relay point for pieces of information being sent over the network. Hubs are a convenient mechanism for creating networks because they minimize the number of cables that have to be run between systems and because they are easy to configure and install.

Another way to allow computers to talk to each other includes making point-to-point connections between the machines. This will be used in several of the network topologies that will be described in the next section.

Lastly, the computers are connected to the hub by way of simple twisted pairs of wires that resemble telephone cords, which make these kinds of networks very easy to build and configure.

Network Configurations

There are a number of different kinds of basic network configurations, each used for a different size of network, or to achieve a different level of network performance.

Bus networks

A bus network is a network in which each node has one connection to a common, shared network resource via some piece of interconnecting hardware like an Ethernet hub, as shown in Figure 2-4. Each node generally can consume as much of the network as it wants, except when it runs up against another node attempting the same. In other words, each node on a bus network generally competes for the shared resources of the single network cable.


Switched networks are very similar to bus networks, except that instead of a hub that allows unmoderated communications between any and all hosts, the interconnection mechanism is a device that lets each connected device see an entire network worth of bandwidth. This device is called a network switch. Network switches are specialized forms of routers where the switch isolates each connected device and sends packets directly between hosts that wish to communicate.

Figure 2-4. Simple bus topology



Cubes of various degrees are also very common topologies for networks that need specialized communications, and for parallel Linux clusters. The "degree" of a cube represents the number of intervening nodes between any two endpoints. Hypercube clusters can be as simple as four nodes that are interconnected, as in Figure 2-5.

Figure 2-5. A two-cube


Figure 2-6 represents a three-cube. In a three-cube, there are eight processors. In such a cube, any processor can talk to any other with a maximum of three hops though intervening nodes.

Figure 2-6. A first-degree hypercube


Cubes of deeper dimensions can be created by connecting lower-order cubes into higher-order cubes; in effect, putting a cube inside a cube.


Lastly, there are various forms of meshes. A mesh is a network topology where nodes are arranged on a grid, as in Figure 2-7. Grids should look familiar if you've ever taken high-school algebra; they're simply Cartesian planes with nodes along the vertical or y-axis, and along the horizontal or x-axis.

Figure 2-7. Meshes


A mesh can be constructed out of combinations of the previous two kinds of networks described, hubs and switches.

The degree of connectivity between any two nodes can be described by the number of nodes between them. Interestingly, if you fold a mesh, you can change its connectivity and make what are, effectively, shortcuts or "worm-holes" that can move traffic between nodes across the mesh (and across a fold or curve) more quickly than routing the packet over the direct or Cartesian path.

Ethernet Communications

Ethernet is much like a room where a large number of people are talking, if everyone tries to talk at once, bits of conversation will be lost. In order for complete conversations to happen, people have to 1) listen before they speak and 2) back off and try again if they verbally collide with someone else who is speaking.

Ethernet employs a similar scheme to try to make its data conversations work. When a computer wants to transmit on the network, it listens first to see if anyone else is talking. If someone else is using the network, it backs off for a few milliseconds, and then listens again to see if the coast is clear and if its information can be transmitted.

Unfortunately, Ethernet as a protocol that is spoken by network interfaces is not as smart as human conversationalists. It doesn't guarantee that any packets put on the network will actually arrive at their intended destination. This low-level protocol doesn't specify any way to recover from a conversational misque.

In order to make sure that the conversations (data connections) between computers are reliable, a higher-level of communication is needed; this is where the "network protocol" comes in.

TCP/IP Networking

The networking protocol that is used in the design of Linux clusters is the Transmission Control Protocol/Internet Protocol, or TCP/IP. This is the same suite of network protocols used to operate the Internet.

TCP/IP is the lingua franca of the Internet. It is a communications protocol specification designed by a committee of people who, although they all worked for different organizations, sometimes with directly competing interests, were able to craft a set of standards whose effectiveness was proven by a set of examples of working software.[4]

It is an open standard, which means anyone can implement it and every implementation of TCP/IP can speak to every other regardless of who implemented it, as long as the TCP/IP implementor didn't deviate from the standard.

TCP/IP Addressing

An IP address consists of four numbers separated by periods. Each piece of the address represents eight bits. This allows 255 values for each of the four address components for a total of over four billion network address combinations.

These addresses are often called "dotted quad addresses" by old-timers. A typical IP address looks like this: This is a convenient human-readable way of representing the 32-bit number that is used by computers in handling these addresses.

Where Do IP Addresses Come From?

Internet addresses are not just "made up," they're allocated by a special group of Internet engineers who keep track of who has been assigned what numbers. This group is called the Internet Assigned Numbers Authority, or IANA. The reason that the addresses are assigned to individuals, network carriers, and corporations is to ensure that all Internet addresses are unique. If Company A connects to the Internet using address "" and Company B does the same using the same IP address, neither of them is going to have a very good day. This is because Internet addresses are used to construct a kind of road map (called a routing table) of who's who, and more importantly who's where on the Internet. This road map is used by the Internet routers to deliver packets from one place to another. If an address shows up in two different places at the same time, it's impossible to know which is the "right" one, and so information would be lost by both Company A and Company B.

The designers of IP realized that networks that used their protocol might come in all different sizes. They split up the address space into several hierarchies of networks, called Class A, Class B, and Class C. Each represents sizes of networks ranging from really huge (millions of machines) all the way down to small office-sized networks of 255 machines and fewer.

Imagine organizing your mailing address in this fashion. It would look something like:


Dotted quad notation can be read in this "largest to smallest" fashion, where the first byte indicates the largest area of the network, the second byte gets closer to the final destination, and so on. In fact, each number in the quad represents an address class.

Let's look at Class C type networks first.

Class C network numbers are addresses where all but the last eight bits are pre-specified. As in Figure 2-8, if you were given a network address that started out "192.168.1," you would be able to specify anything you wanted for the last eight bits of the address. An eight-bit number can be anything in the range of 1 through 254--the first address (0) and the last (255) are reserved address numbers. Class C addresses are very much like the address of an individual house on a street, such as 123 Main Street

Figure 2-8. Class C network addressing


Class C network addresses are identifiable by looking at the first byte of the IP address. If the first number is between 192 and 223 then you're looking at a Class C type address. The next two bytes are specified as the network numbers by IANA, and the third byte is, as specified above, available for whatever devices the user wants to address. In essence, a Class C allows you to address a Local Area Network of up to 255 machines.

Class B networks are the most common kind of network addresses given to very large organizations. For example, most universities and corporations have Class B network addresses. This class of network address allows you to make a network of networks, as shown in Figure 2-9. For example, you could have machines on the network numbered "172.16.1.x " and "172.16.2.x " where "x " is a workstation address.

Figure 2-9. Class B network addressing


Technically, these sub-networks of a larger network are called subnets. If we look at this simply, it is possible to view 172.16.1 and 172.16.2 as two different Class C networks, each of which can have up to 255 machines connected to them. Both 172.16.1 and 172.16.2 are subnets of the Class B network "172.16."

Class B addresses are like the street address of an office building, where the first two bytes specify the office building, the third byte specifies the floor, and the last byte specifies the mail-stop of a given individual.

Starting to get the picture? What a Class B network buys you is the ability to have 255 networks, each with up to 255 hosts on each of the networks. This comes in very handy when you are trying to network a campus or a large number of departments or even a company with offices in a number of cities.

Class B network addresses may also be identified by looking at the first byte of the IP address. If the first number is between 128 and 191 then you're looking at a Class B type address. The second byte is also assigned by IANA.

Finally there is the Class A network address type, pictured in Figure 2-10. The Class A address expands on the capabilities of the Class B networks by adding 255 more possible networks. So, a Class A network is a network that can have 255 networks of 255 networks of 255 hosts per network. This comes out to about 16.7 million possible addresses available to assign to individual hosts.

Figure 2-10. Class A network addressing


A Class A network number can be identified by its first byte; if it's between 1 and 127, then you're looking at a Class A address. The remaining three bytes are assigned by the user. This address class is very much like an inverted country-level postal address that specifies the state (the first byte), the city (the second byte), the street (the third byte), and the house address (the last byte).

The private networks

Some network numbers need to be absent from the publicly accessible Internet, so that anyone could use these numbers on a private network without the risk of conflicting with another group or organization.

IANA and its peer body, the Internet Engineering Task Force (IETF), set aside several large blocks of network numbers for such networks, specifically:

The zeros at the end of the addresses are a nomenclature that indicate the entire network. For example, represents the Class C network 192.124.49.

Networking is at the very heart of parallel Linux clusters. At a minimum, you will need to be able to address machines on a simple Network of Workstations (a "NoW") which looks like a local area network of machines. Advanced cluster configurations will include dozens of addresses configured in complex ways that create multiple private networks across which the data processed by your cluster will travel. These IP basics will also help you with the next topic, routing.


If you are not a networking guru, and have ever wondered how all the bits you see on web sites or the email you receive gets from wherever it is to wherever you are, it's all done through the magic of routing.

Routing is a process by which a specialized computer (called a router) with multiple network interfaces, takes information (packets) that arrive on one interface and delivers them to another interface, given a set of rules about how information should be delivered between the networks.

The set of rules that defines how packets are delivered is called a routing table. As shown in Figure 2-11, it is simply a mapping table that the routing program looks at to determine where it should deliver packets that arrive from the networks to which it's connected.

Figure 2-11. Routing


In the old days of the ARPAnet, routers were just general purpose computers with multiple network connections and special software. Today, routers are specialized computers that run operating systems optimized to the task of routing packets.

Routing on networks usually is done in conjunction with routers. However, for use with clusters, we can use the network interface cards on our machines and some code in the Linux kernel to help us perform the routing function.

Depending upon the configuration you choose for your cluster, you will add multiple network interfaces to your cluster and make use of various routing mechanisms to allow the nodes of your cluster to communicate over that cluster topology. A number of these topologies will be discussed in Chapter 3, Designing Clusters.

Parallel Programming Systems

Lastly, the thing that makes a collection of machines on a network into a parallel computer is software.

There are dozens of different packages that have been developed over the years, but the two that we will be focussing on in this book are MPI and PVM.


MPI is a standard specification for message-passing libraries. MPICH is a portable implementation of the full MPI specification for a wide variety of parallel computing environments, including workstation clusters and massively parallel processors (MPPs).

MPI is the basis of much of the clustering work being done today. It's the result of an industry consortium comprised of computer hardware vendors, software vendors, and users of parallel systems.

MPI is installed by default on clusters built with the CD-ROM that accompanies this book.


PVM is another package that is installed by default on clusters built with the CD-ROM that accompanies this book.

PVM is the result of parallel programming research that started at Oak Ridge National Labs. It is a software package that permits a heterogeneous collection of Unix computers hooked together by a network to be used as a single large parallel computer.


In this chapter, I explained the reason for the development of clusters, which as with many things in life boils down to money (or, in this case, the lack of it). I also touched upon the idea that clusters can be very powerful, even faster sometimes than their commercial counterparts (based on the rankings in the semi-annual "Top 500" supercomputer list). But clusters cost fractions of the cost of traditional supercomputers, and they can be made out of almost anything.

We explored the fact that clusters are so cost effective because they are really just computers that are networked in special ways. They run special software that allows them to solve problems that have been broken down into easy to compute chunks that run on the different nodes in the cluster.

This breaking down of large problems into smaller problems is also the basis of parallelism, which can be explored at levels ranging from the inter-system level (CPUs, disk channels, etc.), to the software level (data/control dependencies, granularity of execution, and program optimization), and up through the level of multiple systems connected by means of networks. At each level of parallelism, there are bits of performance that can be squeezed out that will make your programs run more effectively.

Finally, I covered the various kinds of hardware parallel processing schemes that you are likely to hear about, and introduced a number of networking basics such as IP addressing and various routing topologies that you will be seeing a lot more of in the coming chapters.

The next step will be to take all of the concepts and tools we have discussed here and see how to apply them to the design of an actual cluster.

1. The Top 500 Supercomputer List comes out several times per year and may be viewed at

2. Parallelism and the performance analysis of software is such a complex topic, I can't hope to cover it completely here. In fact, it's so deep that you could write a whole book on the topic. Fortunately, two people already did. Kevin Dowd and Charles Severance are the authors of O'Reilly's High Performance Computing. This book is a "must have" reference work for anyone truly wishing to explore high-performance programming. More information on this book can be found in Appendix A, Resources.

3. The common nomenclature of computer science is to call a computer of a particular type an "x machine" where x is a particular style of computer. For example, a Symmetric Multiprocessor is called an "SMP machine." Occasionally I will use the architecture's full name, and occasionally I will use the former designations for the sake of brevity as I discuss these various kinds of parallel computers.

4. In fact, because there were often so many competing interests involved in the creation of the Internet standards, the slogan of the main standards group for the Internet, the Internet Engineering Task Force (IETF), became "We reject: kings, religion, and presidents. We accept: working code." The result was standards that worked wherever implementors had the good sense to adhere to the IETF standard and resist introducing proprietary (read: incompatible) extensions.

Back to: Building Linux Clusters

O'Reilly Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies

© 2001, O'Reilly & Associates, Inc.