BUY THIS BOOK
Add to Cart

Print Book $34.99


Add to Cart

Print+PDF $45.49

Add to Cart

PDF $27.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £21.99

What is this?

Looking to Reprint or License this content?


Intel Threading Building Blocks
Intel Threading Building Blocks Outfitting C++ for Multi-core Processor Parallelism By James Reinders

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Why Threading Building Blocks?
Intel Threading Building Blocks offers a rich and complete approach to expressing parallelism in a C++ program. It is a library that helps you leverage multi-core processor performance without having to be a threading expert. Threading Building Blocks is not just a threads-replacement library; it represents a higher-level, task-based parallelism that abstracts platform details and threading mechanisms for performance and scalability.
This chapter introduces Intel Threading Building Blocks and how it stands out relative to other options for C++ programmers. Although Threading Building Blocks relies on templates and the C++ concept of generic programming, this book does not require any prior experience with these concepts or with threading.
explains the challenges of parallelism and introduces key concepts that are important for using Threading Building Blocks. Together, these first two chapters set up the foundation of knowledge needed to make the best use of Threading Building Blocks.
Multi-core processors are becoming common, yet writing even a simple parallel_for loop is tedious with existing threading packages. Writing an efficient scalable program is much harder. Scalability embodies the concept that a program should see benefits in performance as the number of processor cores increases.
Threading Building Blocks helps you create applications that reap the benefits of new processors with more and more cores as they become available.
Threading Building Blocks is a library that supports scalable parallel programming using standard C++ code. It does not require special languages or compilers. The ability to use Threading Building Blocks on virtually any processor or any operating system with any C++ compiler makes it very appealing.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Overview
Multi-core processors are becoming common, yet writing even a simple parallel_for loop is tedious with existing threading packages. Writing an efficient scalable program is much harder. Scalability embodies the concept that a program should see benefits in performance as the number of processor cores increases.
Threading Building Blocks helps you create applications that reap the benefits of new processors with more and more cores as they become available.
Threading Building Blocks is a library that supports scalable parallel programming using standard C++ code. It does not require special languages or compilers. The ability to use Threading Building Blocks on virtually any processor or any operating system with any C++ compiler makes it very appealing.
Threading Building Blocks uses templates for common parallel iteration patterns, enabling programmers to attain increased speed from multiple processor cores without having to be experts in synchronization, load balancing, and cache optimization. Programs using Threading Building Blocks will run on systems with a single processor core, as well as on systems with multiple processor cores. Threading Building Blocks promotes scalable data parallel programming. Additionally, it fully supports nested parallelism, so you can build larger parallel components from smaller parallel components easily. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner. The result is that Threading Building Blocks enables you to specify parallelism far more conveniently, and with better results, than using raw threads.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Benefits
As mentioned, the goal of a programmer in a modern computing environment is scalability: to take advantage of both cores on a dual-core processor, all four cores on a quad-core processor, and so on. Threading Building Blocks makes writing scalable applications much easier than it is with traditional threading packages.
There are a variety of approaches to parallel programming, ranging from the use of platform-dependent threading primitives to exotic new languages. The advantage of Threading Building Blocks is that it works at a higher level than raw threads, yet does not require exotic languages or compilers. You can use it with any compiler supporting ISO C++. This library differs from typical threading packages in these ways:
Threading Building Blocks enables you to specify tasks instead of threads
Most threading packages require you to create, join, and manage threads. Programming directly in terms of threads can be tedious and can lead to inefficient programs because threads are low-level, heavy constructs that are close to the hardware. Direct programming with threads forces you to do the work to efficiently map logical tasks onto threads. In contrast, the Threading Building Blocks runtime library automatically schedules tasks onto threads in a way that makes efficient use of processor resources. The runtime is very effective at load balancing the many tasks you will be specifying.
By avoiding programming in a raw native thread model, you can expect better portability, easier programming, more understandable source code, and better performance and scalability in general.
Indeed, the alternative of using raw threads directly would amount to programming in the assembly language of parallel programming. It may give you maximum flexibility, but with many costs.
Threading Building Blocks targets threading for performance
Most general-purpose threading packages support many different kinds of threading, such as threading for asynchronous events in graphical user interfaces. As a result, general-purpose packages tend to be low-level tools that provide a foundation, not a solution. Instead, Threading Building Blocks focuses on the particular goal of parallelizing computationally intensive work, delivering higher-level, simpler solutions.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Thinking Parallel
This chapter is about how to "Think Parallel." It is a brief introduction to the mental discipline that helps you make a program parallelizable in a safe and scalable manner. Even though Intel Threading Building Blocks does much of the work that traditional APIs require the programmer to do, this kind of thinking is a fundamental requirement to writing a good parallel program.
This is the place in the book that defines terms such as scalability and provides the motivation for focusing on these concepts. The topics covered in this chapter are decomposition, scaling, threads, correctness, abstraction, and patterns.
If you don't already approach every computer problem with parallelism in your thoughts, this chapter should be the start of a new way of thinking. If you are already deeply into parallel programming, this chapter can still show how Threading Building Blocks deals with the concepts.
How should we think about parallel programming? This is now a common question because, after decades in a world where most computers had only one central processing unit (CPU), we are now in a world where only "old" computers have one CPU. Multi-core processors are now the norm. Parallel computers are the norm. Therefore, every software developer needs to Think Parallel.
Today, when developers take on a programming job, they generally think about the best approach before programming. We already think about selecting the best algorithm, implementation language, and so on. Now, we need to think about the inherent parallelism in the job first. This will, in turn, drive algorithm and implementation choices. Trying to consider the parallelism after everything else is not Thinking Parallel, and will not work out well.
Threading Building Blocks was designed to make expressing parallelism much easier by abstracting away details and providing strong support for the best ways to program for parallelism. Here is a quick overview of how Threading Building Blocks addresses the topics we will define and review in this chapter:
Decomposition
Learning to decompose your problem into concurrent tasks (tasks that can run at the same time).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Elements of Thinking Parallel
Threading Building Blocks was designed to make expressing parallelism much easier by abstracting away details and providing strong support for the best ways to program for parallelism. Here is a quick overview of how Threading Building Blocks addresses the topics we will define and review in this chapter:
Decomposition
Learning to decompose your problem into concurrent tasks (tasks that can run at the same time).
scaling
Expressing a problem so that there are enough concurrent tasks to keep all the processor cores busy while minimizing the overhead of managing the parallel program.
Threads
A guide to the technology underlying the concurrency in programs—and how they are abstracted by Threading Building Blocks so that you can just focus on your tasks.
Correctness
How the implicit synchronization inherent in Threading Building Blocks helps minimize the use of locks. If you still must use locks, there are special features for using the Intel Thread Checker to find deadlocks and race conditions, which result from errors involving locks.
Abstraction and patterns
How to choose and utilize algorithms, from and .
Caches
A key consideration in improving performance. The Threading Build Blocks task scheduler is already tuned for caches.
Intuition
Thinking in terms of tasks that can run at the same time (concurrent tasks), data decomposed to minimize conflicts among tasks, and recursion.
In everyday life, we find ourselves thinking about parallelism. Here are a few examples:
Long lines
When you have to wait in a long line, you have undoubtedly wished there were multiple shorter (faster) lines, or multiple people at the front of the line helping serve customers more quickly. Grocery store checkout lines, lines to get train tickets, lines to buy coffee, and lines to buy books in a bookstore are examples.
Lots of repetitive work
When you have a big task to do, which many people could help with at the same time, you have undoubtedly wished for more people to help you. Moving all your possessions from an old dwelling to a new one, stuffing letters in envelopes for a mass mailing, and installing the same software on each new computer in your lab are examples.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Decomposition
When you think about your project, how do you find the parallelism?
At the highest level, parallelism exists either in the form of data on which to operate in parallel, or in the form of tasks to execute concurrently. And these forms are not mutually exclusive.
Data parallelism () is easy to picture. Take lots of data and apply the same transformation to each piece of the data. In , each letter in the data set is capitalized and becomes the corresponding uppercase letter. This simple example shows that given a data set and an operation that can be applied element by element, we can apply the same task concurrently to each element. Programmers writing code for supercomputers love this sort of problem and consider it so easy to do in parallel that it has been called embarrassingly parallel. A word of advice: if you have lots of data parallelism, do not be embarrassed—take advantage of it and be very happy. Consider it happy parallelism.
Figure : Data parallelism
Data parallelism is eventually limited by the amount of data you want to process, and your thoughts will then turn to task parallelism (). Task parallelism means lots of different, independent tasks that are linked by sharing the data they consume. This, too, can be embarrassingly parallel. uses as examples some mathematical operations that can each be applied to the same data set to compute values that are independent. In this case, the average value, the minimum value, the binary OR function, and the geometric mean of the data set are computed.
Figure : Task parallelism
Pure task parallelism is harder to find than pure data parallelism. Often, when you find task parallelism, it's a special kind referred to as pipelining. In this kind of algorithm, many independent tasks need to be applied to a stream of data. Each item is processed by stages as they pass through, as shown by the letter A in . A stream of data can be processed more quickly if you use a pipeline because different items can pass through different stages at the same time, as shown in . A pipeline can also be more sophisticated than other processes: it can reroute data or skip steps for chosen items. Automobile assembly lines are good examples of pipe-lines; materials flow through a pipeline and get a little work done at each step ().
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Scaling and Speedup
The scalability of a program is a measure of how much speedup the program gets as you add more and more processor cores. Speedup is the ratio of the time it takes to run a program without parallelism versus the time it runs in parallel. A speedup of 2X indicates that the parallel program runs in half the time of the sequential program. An example would be a sequential program that takes 34 seconds to run on a one-processor machine and 17 seconds to run on a quad-core machine.
As a goal, we would expect that our program running on two processor cores should run faster than the program running on one processor core. Likewise, running on four processor cores should be faster than running on two cores.
We say that a program does not scale beyond a certain point when adding more processor cores no longer results in additional speedup. When this point is reached, it is common for performance to fall if we force additional processor cores to be used. This is because the overhead of distributing and synchronizing begins to dominate. Threading Building Blocks has some algorithm templates which use the notion of a grain size to help limit the splitting of data to a reasonable level to avoid this problem. Grain size will be introduced and explained in detail in and .
As Thinking Parallel becomes intuitive, structuring problems to scale will become second nature.
The topic of how much parallelism there is in an application has gotten considerable debate, and the answer is "it depends."
It certainly depends on the size of the problem to be solved and on the ability to find a suitable algorithm to take advantage of the parallelism. Much of this debate previously has been centered on making sure we write efficient and worthy programs for expensive and rare parallel computers. The definition of size, the efficiency required, and the expense of the computer have all changed with the emergence of multi-core processors. We need to step back and be sure we review the ground we are standing on. The world has changed.

Amdahl's Law

Gene Amdahl, renowned computer architect, made observations regarding the maximum improvement to a computer system that can be expected when only a portion of the system is improved. His observations in 1967 have come to be known as
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is a Thread?
If you know what a thread is, skip ahead to the section "Safety in the Presence of Concurrency." It's important to be comfortable with the concept of a thread, even though the goal of Intel Threading Building Blocks is to abstract away thread management. Fundamentally, you will still be constructing a threaded program and you will need to understand the implications of this underlying implementation.
All modern operating systems are multitasking operating systems that typically use a preemptive scheduler. Multitasking means that more than one program can be active at a time. You may take it for granted that you can have an email program and a web browser program running at the same time. Yet, not that long ago, this was not the case.
A preemptive scheduler means the operating system puts a limit on how long one program can use a processor core before it is forced to let another program use it. This is how the operating system makes it appear that your email program and your web browser are running at the same time when only one processor core is actually doing the work.
Generally, each process (program) runs relatively independent of other processes. In particular, the memory where your program variables will reside is completely separate from the memory used by other processes. Your email program cannot directly assign a new value to a variable in the web browser program. If your email program can communicate with your web browser—for instance, to have it open a web page from a link you received in email—it does so with some form of communication that takes much more time than a memory access.
Breaking a problem into multiple processes and using only a restricted, mutually agreed-upon communication between them has a number of advantages. One of the advantages is that an error in one process will be less likely to interfere with other processes. Before multitasking operating systems, it was much more common for a single program to be able to crash the entire machine. Putting tasks into processes, and limiting interaction with other processes and the operating system, has greatly added to system stability.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Mutual Exclusion and Locks
You need to think about whether concurrent accesses to the same resources will occur in your program. The resource with which you will most often be concerned is data held in memory, but you also need to think about files and I/O of all kinds.
The best policy is to decompose your problem in such a way that synchronization is implicit instead of explicit. You can achieve this by breaking up the tasks so that they can work independently, and the only synchronization that occurs is waiting for all the tasks to be completed at the end.
Instead of locks, which are shown in , you can use a small set of operations that the system guarantees to be atomic. An atomic operation is equivalent to an instruction that cannot be interrupted.
When explicit synchronization and atomic operations are insufficient, locks needs to be used. covers the various options for mutual exclusion.
Consider a program with two threads. We start with X = 44. Thread A executes X = X + 10. Thread B executes X = X - 12. If we add locking () so that only Thread A or Thread B can execute its statement at a time, we end up with X = 42. If both threads try to obtain a lock at the same time, one will be excluded and will have to wait before the lock is granted. shows how long Thread B might have to wait if it requested the lock at the same time as Thread A but did not get the lock because Thread A had it first.
Figure : Predictable outcome due using mutual exclusion
Without the locks, a race condition exists and at least two more results are possible: X = 32 or X = 54. X = 42 can still occur as well (). Three results are now possible because each statement reads X, does a computation, and writes to X. Without locking, there is no guarantee that a thread reads the value of X before or after the other thread writes a value.
Figure : Results of race condition (no mutual exclusion)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Correctness
The biggest challenge of learning to Think Parallel is understanding correctness as it relates to concurrency. Concurrency means you have multiple threads of control that are active at one time. The operating system is going to schedule those threads in a number of ways. Each time the program runs, the precise order of operations will potentially be different. Your challenge as a programmer is to make sure that every legitimate way the operations in your concurrent program can be ordered will still lead to the correct result. A high-level abstraction such as Threading Building Blocks helps a great deal, but there are a few issues you have to grapple with on your own: potential variations in results when programs compute results in parallel, and new types of programming bugs when locks are used incorrectly.
Computations done in parallel often get different results than the original sequential program. Round-off errors are the most common surprise for many programmers when a program is modified to run in parallel. You should expect numeric results to vary slightly when computations are changed to run in parallel. For example, computing (A+B+C+D)as ((A+B)+(C+D))enables A+B and C+D to be computed in parallel, but the final sum may be slightly different from other evaluations such as (((A+B)+C)+D). Even the parallel results can differ from run to run, depending on the order of the operations.
A few types of program failures can happen only in a parallel program because they involve the coordination of tasks. These failures are known as deadlocks and race conditions. Although Threading Building Blocks simplifies programming so as to reduce the chance for such failures, they are still quite possible. Multithreaded programs can be nondeterministic as a result, which means the same program with the same input can follow different execution paths each time it is invoked. When this occurs, failures do not repeat consistently and debugger intrusions can easily change the failure, thereby making debugging frustrating, to say the least.
Tracking down and eliminating the source of unwanted nondeterminism is not easy. Specialized tools such as the Intel Thread Checker help, but the first step is to understand these issues and try to avoid them.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Abstraction
When writing a program, choosing an appropriate level of abstraction is important. Few programmers use assembly language anymore. Programming languages such as C and C++ have abstracted away the low-level details. Hardly anyone misses the old programming method.
Parallelism is no different. You can easily get caught up in writing code that is too low-level. Raw thread programming requires you to manage threads, which is time-consuming and error-prone.
Programming in Threading Building Blocks offers an opportunity to avoid thread management. This will result in code that is easier to create, easier to maintain, and more elegant. However, it does require thinking of algorithms in terms of what work can be divided and how data can be divided.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Patterns
Mark Twain once observed, "The past does not repeat itself, but it does rhyme." And so it is with computer programs: code may not be reused over and over without change, but patterns do emerge.
Condensing years of parallel programming experience into patterns is not an exact science. However, we can explain parallel programming approaches in more detail than just talking about task versus data decomposition.
Object-oriented programming has found value in the Gang of Four (Gamma, Helm, Johnson, and Vlissides) and their landmark work, Design Patterns: Elements of Reusable Object-Oriented Software (Addison Wesley). Many credit that book with bringing more order to the world of object-oriented programming. The book gathered the collective wisdom of the community and boiled it down into simple "patterns" with names, so people could talk about them.
A more recent book, Patterns for Parallel Programming, by Mattson et al. (Addison Wesley), has similarly collected the wisdom of the parallel programming community. Its point is that the field of parallel programming has moved from chaos to established practice. Experts use common tricks and have their own language to talk about these tricks. With these patterns in mind, programmers can quickly get up to speed in this new field, just as object-oriented programmers have done for years with the famous Gang of Four book.
Patterns for Parallel Programming is a serious and detailed look at how to approach parallelism. Tim Mattson often lectures on this topic, and he helped me summarize how the patterns relate to Threading Building Blocks. This will help connect the concepts of this book with the patterns for parallelism that his book works to describe.
Mattson et al. propose that programmers need to work through four design spaces when moving from first thoughts to a parallel program ( summarizes these explanations):
Finding concurrency
This step was discussed earlier in this chapter, in the section "Decomposition." Threading Building Blocks simplifies finding concurrency by encouraging you to find one or more tasks without worrying about mapping them to hardware threads. For some algorithms (e.g.,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Intuition
After reading this chapter, you should be able to explain Thinking Parallel in terms of decomposition, scaling, correctness, abstraction, and patterns.
Once you understand these five concepts, and you can juggle them in your head when considering a programming task, you are Thinking Parallel. You will be developing an intuition about parallelism that will serve you well. Already, programmers seek to develop a sense of when a problem should use a parser, call on a sort algorithm, involve a database, use object-oriented programming, and so on. We look for patterns, think about decomposition, understand abstractions, and anticipate what approaches will be easier to debug. Parallelism is no different in this respect.
Developing an intuition to Think Parallel requires nothing more than understanding this chapter and trying it on a few projects. Intel Threading Building Blocks helps with each type of thinking presented in the chapter. Combined with Think Parallel intuition, Threading Building Blocks simply helps with parallel programming.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Basic Algorithms
This is the key chapter in learning Intel Threading Building Blocks. Here you will come to understand the recursion, task-stealing, and algorithm templates that Threading Building Blocks uniquely combines.
The most visible contribution of Threading Building Blocks is the algorithm templates covered in this chapter and the next chapter. This chapter introduces the simplest loop-oriented algorithms based on recursive ranges, and the next chapter expands on that with more advanced algorithm support. Future chapters offer details that round out features needed to make your use of Threading Building Blocks complete.
Threading Building Blocks offers the following types of generic parallel algorithms, which are covered in this chapter:
Loop parallelization
parallel_for and parallel_reduce
Load-balanced, parallel execution of a fixed number of independent loop iterations
parallel_scan
A template function that computes a prefix computation (also known as a scan) in parallel (y[i]=y [i-1] op x[i])
The validity of the term Building Block is clearer when you see how fully Threading Building Blocks supports nested parallelism to enable you to build larger parallel components from smaller parallel components. A Quicksort example shown in , for instance, was implemented using parallel_for recursively. The recursion is implicit because of Threading Building Blocks' inherent concept of splitting, as embodied in the parallel iterator.
When you thoroughly understand why a recursive algorithm such as Quicksort should use the parallel_for template with a recursive range instead of using a recursion template, you will understand a great deal about how to apply Threading Building Blocks to your applications.
Understanding some fundamental concepts can make the parallelism model of Threading Building Blocks intuitive. Most fundamental is the reliance on breaking up problems recursively as required to get to the right level of parallel tasks. The proper degree of breakdown in a problem is embodied in a concept called grain size. Grain size started as a mysterious manual process, which has since been facilitated with some automation (heuristics) in the latest versions of Threading Building Blocks. This chapter offers rules of thumb, based on experience with Threading Building Blocks, for picking the best grain size.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Initializing and Terminating the Library
Intel Threading Building Blocks components are defined in the tbb namespace. For brevity's sake, the namespace is explicit in the first mention of a component in this book, but implicit afterward.
Any thread that uses an algorithm template from the library or the task scheduler must have an initialized tbb::task_scheduler_init object. A thread may have more than one of these objects initialized at a time. The task scheduler shuts down when all task_scheduler_init objects terminate. By default, the constructor for task_ scheduler_init does the initialization and the destructor does the termination. Thus, declaring a task_scheduler_init in main(), as in , both starts and shuts down the scheduler.
Example . Initializing the library
#include "tbb/task_scheduler_init.h"
using namespace tbb;

int main() {
    task_scheduler_init init;
     ...
    return 0;
}
The using directive in the example enables you to use the library identifiers without having to write out the namespace prefix tbb before each identifier. The rest of the examples assume that such a using directive is present.
Automatic startup/shutdown was not implemented because, based on Intel's experience in implementing OpenMP, we knew that parts are too problematic on some operating systems to do it behind the scenes. In particular, always knowing when a thread shuts down can be quite problematic.
Calling the initialization more than once will not cause the program to fail, but it is a bit wasteful and can cause a flurry of extra warnings from some debugging or analysis tools, such as the Intel Thread Checker.
The section "Mixing with Other Threading Packages" in explains how to construct task_scheduler_init objects if your program creates threads itself using another interface.
The constructor for task_scheduler_init takes an optional parameter that specifies the number of desired threads, including the calling thread. The optional parameter can be one of the following:
  • The value task_scheduler_init::automatic, which is the default when the parameter is not specified. It exists for the sake of the method task_scheduler_ init::initialize
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Loop Parallelization
The simplest form of scalable parallelism is a loop of iterations that can each run simultaneously without interfering with each other. The following sections demonstrate how to parallelize such simple loops.
parallel_for and parallel_reduce
Load-balanced, parallel execution of a fixed number of independent loop iterations
parallel_scan
A template function that computes a parallel prefix (y[i] = y[i-1] op x[i])
Suppose you want to apply a function Foo to each element of an array, and it is safe to process each element concurrently. shows the sequential code to do this.
Example . Original loop code
void SerialApplyFoo( float a[], size_t n ) {
    for( size_t i=0; i>n; ++i )
        Foo(a[i]);
}
The iteration space here is of type size_t, and it goes from 0 to n-1. The template function tbb::parallel_for breaks this iteration space into chunks and runs each chunk on a separate thread.
The first step in parallelizing this loop is to convert the loop body into a form that operates on a chunk. The form is a Standard Template Library (STL)-style function object, called the body object, in which operator() processes a chunk. declares the body object.
Example . A class for use by a parallel_for
#include "tbb/blocked_range.h"

class ApplyFoo {
    float *const my_a;
public:
    void operator()( const blocked_range<size_t>& r ) const {
        float *a = my_a;
        for( size_t i=r.begin(); i!=r.end(); ++i )
            Foo(a[i]);
    }
    ApplyFoo( float a[] ) :
        my_a(a)
    {}
};
Note the iteration space argument to operator().A blocked_range<T> is a template class provided by the library. It describes a one-dimensional iteration space over type T. Class parallel_for works with other kinds of iteration spaces, too. The library provides blocked_range2d for two-dimensional spaces. A little later in this chapter, in the section "Advanced Topic: Other Kinds of Iteration Spaces," I will explain how you can define your own spaces.
An instance of ApplyFoo needs member fields that remember all the local variables that were defined outside the original loop but were used inside it. Usually, the constructor for the body object will initialize these fields, though
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Recursive Range Specifications
Most algorithms provided by the Threading Building Blocks library are generic and operate on all types that model the necessary concepts. Recursive ranges define the space for the algorithm to operate upon and therefore are important to understand.
Splittable Concept
Requirements for a type whose instances can be split into two pieces. lists the requirements for a splittable type X with instance x
Table : Splittable Concept
Pseudosignature
Semantics
X::X(X& x, split)
Split x into two parts, one reassigned to x and the other to the newly constructed object.
Description
A type is splittable if it has a splitting constructor that allows an instance to be split into two pieces. The splitting constructor takes as arguments a reference to the original object, and a dummy argument of type split, which is defined by the library. The dummy argument distinguishes the splitting constructor from a copy constructor. After the constructor runs, x and the newly constructed object should represent the two pieces of the original x. The library uses splitting constructors in two contexts:
  • Partitioning a range into two subranges that can be processed concurrently
  • Forking a body (function object) into two bodies that can run concurrently
The following model types provide examples.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Recursive Range Specifications
blocked_range and blocked_range2d represent splittable ranges. For each of these, splitting partitions the range into two subranges.
The bodies for parallel_reduce, parallel_scan, simple_partitioner, and auto_partitioner must be splittable. For each of these, splitting results in two bodies that can run concurrently.
split Class
Type for dummy argument of a splitting constructor.
#include "tbb/tbb_stddef.h"

class split;
Description
An argument of type split is used to distinguish a splitting constructor from a copy constructor.
Members
	namespace tbb {
	    class split {
	    };
	}
Range Concept
Requirements for a type representing a recursively divisible set of values. lists the requirements for a Range type R.
Description
Table : Range Concept
Pseudosignature
Semantics
R::R( const R& )
Copy constructor
R::~R()
Destructor
bool R::empty() const
True if range is empty
bool R::is_divisible() const
True if range can be partitioned into two subranges
R::R( R& r, split )
Split r into two subranges
Description
A Range can be recursively subdivided into two parts. It is recommended that the division be into nearly equal parts, but it is not required. Splitting as evenly as possible typically yields the best parallelism. Ideally, a range is recursively splittable until the parts represent portions of work that are more efficient to execute serially rather than split further
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary of Loops
The high-level loop templates in Intel Threading Building Blocks give you efficient, scalable ways to exploit the power of multi-core chips without having to start from scratch. They let you design your software at a concurrent task level and not worry about low-level manipulation of threads. Because they are generic, you can customize them to your specific needs. Although algorithms in this chapter can unlock the power of multi-core processing for many applications, sometimes you will require more complex algorithms. The next chapter takes the models shown in this chapter to a higher level.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Advanced Algorithms
Algorithm templates are the keys to using Intel Threading Building Blocks. This chapter presents some relatively complex algorithms that build on the foundation laid in , so you should understand before jumping into this chapter. This chapter covers Threading Building Blocks' support for the following types of generic parallel algorithms.
Parallel algorithms for streams:
parallel_while
Use for an unstructured stream or pile of work. Offers the ability to add additional work to the pile while running.
pipeline
Use when you have a linear sequence of stages. Specify the maximum number of items that can be in transit. Each stage can be serial or parallel. This uses the cache efficiently because each worker thread takes an item through as many stages as possible, and the algorithm is biased toward finishing old items before tackling new ones.
Parallel sort:
parallel_sort
A comparison sort with an average time complexity not to exceed O(n log n) on a single processor and approaching O(N) as more processors are used. When worker threads are available, parallel_sort creates subtasks that may be executed concurrently.
You can successfully parallelize many applications using only the constructs discussed thus far. However, some situations call for other parallel patterns. This section describes the support for some of these alternative patterns:
parallel_while
Use for an unstructured stream or pile of work. Offers the ability to add additional work to the pile while running.
pipeline
Use when you have a linear pipeline of stages. Specify the maximum number of items that can be in flight. Each stage can be serial or parallel. This uses the cache efficiently because each worker thread handles an item through as many stages as possible, and the algorithm is biased toward finishing old items before tackling new ones.
For some loops, the end of the iteration space is not known in advance, or the loop body may add more iterations to do before the loop exits. You can deal with both situations using the template class
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Parallel Algorithms for Streams
You can successfully parallelize many applications using only the constructs discussed thus far. However, some situations call for other parallel patterns. This section describes the support for some of these alternative patterns:
parallel_while
Use for an unstructured stream or pile of work. Offers the ability to add additional work to the pile while running.
pipeline
Use when you have a linear pipeline of stages. Specify the maximum number of items that can be in flight. Each stage can be serial or parallel. This uses the cache efficiently because each worker thread handles an item through as many stages as possible, and the algorithm is biased toward finishing old items before tackling new ones.
For some loops, the end of the iteration space is not known in advance, or the loop body may add more iterations to do before the loop exits. You can deal with both situations using the template class tbb::parallel_while.
A linked list is an example of an iteration space that is not known in advance. In parallel programming, it is usually better to use dynamic arrays instead of linked lists because accessing items in a linked list is inherently serial. But if you are limited to linked lists, if the items can be safely processed in parallel, and if processing each item takes at least a few thousand instructions, you can use parallel_while in a situation where the serial form is as shown in .
Example . Original list processing code
void SerialApplyFooToList( Item*root ) {
    for( Item* ptr=root; ptr!=NULL; ptr=ptr->next )
        Foo(pointer->data);
}
If Foo takes at least a few thousand instructions to run, you can get parallel speedup by converting the loop to use parallel_while. Unlike the templates described earlier, parallel_while is a class, not a function, and it requires two user-defined objects. The first object defines the stream of items. The object must have a method, pop_if_ present, such that when bool b =pop_if_present(v) is invoked, it sets v to the next iteration value if there is one and returns true. If there are no more iterations, it returns
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Containers
Intel Threading Building Blocks provides highly concurrent containers that permit multiple threads to invoke a method simultaneously on the same container. At this time, a concurrent queue, vector, and hash map are provided. All of these highly concurrent containers can be used with this library, OpenMP, or raw threads.
Highly concurrent containers are very important because Standard Template Library (STL) containers generally are not concurrency-friendly, and attempts to modify them concurrently can easily corrupt the containers. As a result, it is standard practice to wrap a lock (mutex) around STL containers to make them safe for concurrent access, by letting only one thread operate on the container at a time. But that approach eliminates concurrency, and thus is not conducive to multi-core parallelism.
As much as possible, the interfaces of the Threading Building Blocks containers are similar to STL, but they do not match completely because some STL interfaces are inherently not thread-safe. The Threading Building Blocks containers provide fine-grained locking or lock-free implementations, and sometimes both.
Fine-grained locking
Multiple threads operate on the container by locking only those portions they really need to lock. As long as different threads access different portions, they can proceed concurrently.
Lock-free algorithms
Different threads proceed straight through the operation without locks, accounting for and correcting the effects of other interfering threads. There is inevitably some waiting at the end, but the contention over locks can be avoided during the operations.
Locks are worth avoiding because they limit concurrency, and mistakes create problems that are difficult to debug. Threading Building Blocks avoids the need for locks, but does not guarantee you freedom from locks.
Highly concurrent containers come at a cost. They typically have higher overhead than regular STL containers, and so operations on highly concurrent containers may take longer than for STL containers. Therefore, you should use highly concurrent containers when the speedup from the additional concurrency that they enable outweighs their slower sequential performance.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
concurrent_queue
The template class concurrent_queue<T> implements a concurrent queue with values of type T. Multiple threads may simultaneously push and pop elements from the queue.
In a single-threaded program, a queue is a first-in first-out structure. But if multiple threads are pushing and popping concurrently, the definition of first is uncertain. The only guarantee of ordering offered by concurrent_queue is that if a thread pushes multiple values, and another thread pops those same values, they will be popped in the same order that they were pushed.
Pushing is provided by the push method. There are blocking and nonblocking flavors of pop:
pop_if_present
This method is nonblocking: it attempts to pop a value, and if it cannot because the queue is empty, it returns anyway.
pop
This method blocks until it pops a value. If a thread must wait for an item to become available and it has nothing else to do, it should use pop(item) and not while(!pop_if_present(item)) continue; because pop uses processor resources more efficiently than the loop.
Unlike most STL containers, concurrent_queue::size_type is a signed integral type, not unsigned. This is because concurrent_queue::size() is defined as the number of push operations started minus the number of pop operations started. If pops out-number pushes, size() becomes negative. For example, if a concurrent_queue is empty and there are n pending pop operations, size() returns –n. This provides an easy way for producers to know how many consumers are waiting on the queue. In particular, consumers may find the empty() method useful; it is defined to be true if and only if size() is not positive.
By default, a concurrent_queue<T> is unbounded. It may hold any number of values until memory runs out. It can be bounded by setting the queue capacity with the set_capacity method. Setting the capacity causes push to block until there is room in the queue. Bounded queues are slower than unbounded queues, so if there is a constraint elsewhere in your program that prevents the queue from becoming too large, it is better not to set the capacity.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
concurrent_vector
A concurrent_vector<T> is a dynamically growable array of items of type T for which it is safe to simultaneously access elements in the vector while growing it. However, be careful not to let another task access an element that is under construction or is otherwise being modified. For safe concurrent growing, concurrent_vector has two methods for resizing that support common uses of dynamic arrays: grow_by and grow_to_at_least. The index of the first element is 0. The method grow_by(n) enables you to safely append n consecutive elements to a vector, and returns the index of the first appended element. Each element is initialized with T(). So for instance, safely appends a C string to a shared vector.
Example . Concurrent vector
void Append( concurrent_vector<char>& vector, const char* string ) {
    size_t n = strlen(string)+1;
    memcpy( &vector[vector.grow_by(n)], string, n+1 );
}
The related method grow_to_at_least(n) grows a vector to size n if it is shorter. Con-current calls to grow_by and grow_to_at_least do not necessarily return in the order that elements are appended to the vector.
The size() method returns the number of elements in the vector, which may include elements that are still undergoing concurrent construction by grow_by and grow_to_ at_least. Also, it is safe to use iterators while the concurrent_vector is being grown, as long as the iterators never go past the current value of end(). However, the iterator may reference an element undergoing concurrent construction. You must synchronize construction and access of an element.
A concurrent_vector<T> never moves an element until the array is cleared, which can be an advantage over the STL std::vector (which can move elements to resize the vector), even for single-threaded code. However, concurrent_vector does have more overhead than std::vector. Use concurrent_vector only if you really need to dynamically resize it while other accesses are (or might be) in flight, or if you require that an element never move.
Operations on concurrent_vector are concurrency-safe with respect to growing, but are not safe for clearing or destroying a vector. Never invoke
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
concurrent_vector
Content preview·Buy PDF of this chapter|