Book description
Complex calculations, like training deep learning models or running large-scale simulations, can take an extremely long time. Efficient parallel programming can save hours—or even days—of computing time. Parallel and High Performance Computing shows you how to deliver faster run-times, greater scalability, and increased energy efficiency to your programs by mastering parallel techniques for multicore processor and GPU hardware.About the Technology
Write fast, powerful, energy efficient programs that scale to tackle huge volumes of data. Using parallel programming, your code spreads data processing tasks across multiple CPUs for radically better performance. With a little help, you can create software that maximizes both speed and efficiency.
About the Book
Parallel and High Performance Computing offers techniques guaranteed to boost your code’s effectiveness. You’ll learn to evaluate hardware architectures and work with industry standard tools such as OpenMP and MPI. You’ll master the data structures and algorithms best suited for high performance computing and learn techniques that save energy on handheld devices. You’ll even run a massive tsunami simulation across a bank of GPUs.
What's Inside
- Planning a new parallel project
- Understanding differences in CPU and GPU architecture
- Addressing underperforming kernels and loops
- Managing applications with batch scheduling
About the Reader
For experienced programmers proficient with a high-performance computing language like C, C++, or Fortran.
About the Authors
Robert Robey works at Los Alamos National Laboratory and has been active in the field of parallel computing for over 30 years. Yuliana Zamora is currently a PhD student and Siebel Scholar at the University of Chicago, and has lectured on programming modern hardware at numerous national conferences.
Quotes
If you want to learn about parallel programming and high-performance computing based on practical and working examples, this book is for you.
- Tuan A. Tran, ExxonMobil
A great survey of recent advances on parallel and multi-processor software techniques.
- Albert Choy, OSI Digital Grid Solutions
An in-depth treatise on parallel computing from both a software- and hardware-optimized standpoint.
- Jean François Morin, Laval University
This book will show you how to design code that takes advantage of all the computing power modern computers offer.
- Alessandro Campeis, Vimar
Table of contents
- Parallel and High Performance Computing
- Copyright
- Dedication
- contents
- front matter
- Part 1 Introduction to parallel computing
-
1 Why parallel computing?
- 1.1 Why should you learn about parallel computing?
- 1.2 The fundamental laws of parallel computing
- 1.3 How does parallel computing work?
- 1.4 Categorizing parallel approaches
- 1.5 Parallel strategies
- 1.6 Parallel speedup versus comparative speedups: Two different measures
- 1.7 What will you learn in this book?
- Summary
- 2 Planning for parallelization
- 3 Performance limits and profiling
- 4 Data design and performance models
-
5 Parallel algorithms and patterns
- 5.1 Algorithm analysis for parallel computing applications
- 5.2 Performance models versus algorithmic complexity
- 5.3 Parallel algorithms: What are they?
- 5.4 What is a hash function?
- 5.5 Spatial hashing: A highly-parallel algorithm
- 5.6 Prefix sum (scan) pattern and its importance in parallel computing
- 5.7 Parallel global sum: Addressing the problem of associativity
- 5.8 Future of parallel algorithm research
- 5.9 Further explorations
- Summary
- Part 2 CPU: The parallel workhorse
-
6 Vectorization: FLOPs for free
- 6.1 Vectorization and single instruction, multiple data (SIMD) overview
- 6.2 Hardware trends for vectorization
-
6.3 Vectorization methods
- 6.3.1 Optimized libraries provide performance for little effort
- 6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time1)
- 6.3.3 Teaching the compiler through hints: Pragmas and directives
- 6.3.4 Crappy loops, we got them: Use vector intrinsics
- 6.3.5 Not for the faint of heart: Using assembler code for vectorization
- 6.4 Programming style for better vectorization
- 6.5 Compiler flags relevant for vectorization for various compilers
- 6.6 OpenMP SIMD directives for better portability
- 6.7 Further explorations
- Summary
-
7 OpenMP that performs
- 7.1 OpenMP introduction
- 7.2 Typical OpenMP use cases: Loop-level, high-level, and MPI plus OpenMP
- 7.3 Examples of standard loop-level OpenMP
- 7.4 Variable scope importance for correctness in OpenMP
- 7.5 Function-level OpenMP: Making a whole function thread parallel
- 7.6 Improving parallel scalability with high-level OpenMP
- 7.7 Hybrid threading and vectorization with OpenMP
- 7.8 Advanced examples using OpenMP
- 7.9 Threading tools essential for robust implementations
- 7.10 Example of a task-based support algorithm
- 7.11 Further explorations
- Summary
-
8 MPI: The parallel backbone
- 8.1 The basics for an MPI program
- 8.2 The send and receive commands for process-to-process communication
- 8.3 Collective communication: A powerful component of MPI
- 8.4 Data parallel examples
- 8.5 Advanced MPI functionality to simplify code and enable optimizations
- 8.6 Hybrid MPI plus OpenMP for extreme scalability
- 8.7 Further explorations
- Summary
- Part 3 GPUs: Built to accelerate
-
9 GPU architectures and concepts
- 9.1 The CPU-GPU system as an accelerated computational platform
- 9.2 The GPU and the thread engine
- 9.3 Characteristics of GPU memory spaces
- 9.4 The PCI bus: CPU to GPU data transfer overhead
- 9.5 Multi-GPU platforms and MPI
- 9.6 Potential benefits of GPU-accelerated platforms
- 9.7 When to use GPUs
- 9.8 Further explorations
- Summary
-
10 GPU programming model
-
10.1 GPU programming abstractions: A common framework
- 10.1.1 Massive parallelism
- 10.1.2 Inability to coordinate among tasks
- 10.1.3 Terminology for GPU parallelism
- 10.1.4 Data decomposition into independent units of work: An NDRange or grid
- 10.1.5 Work groups provide a right-sized chunk of work
- 10.1.6 Subgroups, warps, or wavefronts execute in lockstep
- 10.1.7 Work item: The basic unit of operation
- 10.1.8 SIMD or vector hardware
- 10.2 The code structure for the GPU programming model
- 10.3 Optimizing GPU resource usage
- 10.4 Reduction pattern requires synchronization across work groups
- 10.5 Asynchronous computing through queues (streams)
- 10.6 Developing a plan to parallelize an application for GPUs
- 10.7 Further explorations
- Summary
-
10.1 GPU programming abstractions: A common framework
-
11 Directive-based GPU programming
- 11.1 Process to apply directives and pragmas for a GPU implementation
-
11.2 OpenACC: The easiest way to run on your GPU
- 11.2.1 Compiling OpenACC code
- 11.2.2 Parallel compute regions in OpenACC for accelerating computations
- 11.2.3 Using directives to reduce data movement between the CPU and the GPU
- 11.2.4 Optimizing the GPU kernels
- 11.2.5 Summary of performance results for the stream triad
- 11.2.6 Advanced OpenACC techniques
- 11.3 OpenMP: The heavyweight champ enters the world of accelerators
- 11.4 Further explorations
- Summary
-
12 GPU languages: Getting down to basics
- Figure 12.1 The interoperability map for the GPU languages shows an increasingly complex situation. Four GPU languages are shown at the top with the various hardware devices at the bottom. The arrows show the code generation pathways from the languages to the hardware. The dashed lines are for hardware that is still in development.
- 12.1 Features of a native GPU programming language
- 12.2 CUDA and HIP GPU languages: The low-level performance option
-
12.3 OpenCL for a portable open source GPU language
- 12.3.1 Writing and building your first OpenCL application
- 12.3.2 Reductions in OpenCL
- Figure 12.3 Comparison of OpenCL and CUDA reduction kernels: sum_within_block
- Figure 12.4 Comparison for the first of two kernel passes for the OpenCL and CUDA reduction kernels
- Figure 12.5 Comparison of the second pass for the reduction sum
- 12.4 SYCL: An experimental C++ implementation goes mainstream
- 12.5 Higher-level languages for performance portability
- 12.6 Further explorations
- Summary
-
13 GPU profiling and tools
- 13.1 An overview of profiling tools
- 13.2 How to select a good workflow
- 13.3 Example problem: Shallow water simulation
-
13.4 A sample of a profiling workflow
- 13.4.1 Run the shallow water application
- 13.4.2 Profile the CPU code to develop a plan of action
- 13.4.3 Add OpenACC compute directives to begin the implementation step
- 13.4.4 Add data movement directives
- 13.4.5 Guided analysis can give you some suggested improvements
- 13.4.6 The NVIDIA Nsight suite of tools can be a powerful development aid
- 13.4.7 CodeXL for the AMD GPU ecosystem
- 13.5 Don’t get lost in the swamp: Focus on the important metrics
- 13.6 Containers and virtual machines provide alternate workflows
- 13.7 Cloud options: A flexible and portable capability
- 13.8 Further explorations
- Summary
- Part 4 High performance computing ecosystems
-
14 Affinity: Truce with the kernel
- 14.1 Why is affinity important?
- 14.2 Discovering your architecture
- 14.3 Thread affinity with OpenMP
- 14.4 Process affinity with MPI
- 14.5 Affinity for MPI plus OpenMP
- 14.6 Controlling affinity from the command line
- 14.7 The future: Setting and changing affinity at run time
- 14.8 Further explorations
- Summary
- 15 Batch schedulers:Bringing order to chaos
-
16 File operations for a parallel world
- 16.1 The components of a high-performance filesystem
- 16.2 Standard file operations: A parallel-to-serial interface
- 16.3 MPI file operations (MPI-IO) for a more parallel world
- 16.4 HDF5 is self-describing for better data management
- 16.5 Other parallel file software packages
- 16.6 Parallel filesystem: The hardware interface
- 16.7 Further explorations
- Summary
-
17 Tools and resources for better code
- 17.1 Version control systems: It all begins here
- 17.2 Timer routines for tracking code performance
- 17.3 Profilers: You can’t improve what you don’t measure
- 17.4 Benchmarks and mini-apps: A window into system performance
-
17.5 Detecting (and fixing) memory errors for a robust application
- 17.5.1 Valgrind Memcheck: The open source standby
- 17.5.2 Dr. Memory for your memory ailments
- 17.5.3 Commercial memory tools for demanding applications
- 17.5.4 Compiler-based memory tools for convenience
- 17.5.5 Fence-post checkers detect out-of-bounds memory accesses
- 17.5.6 GPU memory tools for robust GPU applications
- 17.6 Thread checkers for detecting race conditions
- 17.7 Bug-busters: Debuggers to exterminate those bugs
- 17.8 Profiling those file operations
- 17.9 Package managers: Your personal system administrator
- 17.10 Modules: Loading specialized toolchains
- 17.11 Reflections and exercises
- Summary
- appendix A References
- appendix B Solutions to exercises
- appendix C Glossary
- index
Product information
- Title: Parallel and High Performance Computing
- Author(s):
- Release date: July 2021
- Publisher(s): Manning Publications
- ISBN: 9781617296468
You might also like
book
High Performance Python, 2nd Edition
Your Python code may run correctly, but you need it to run faster. Updated for Python …
book
CUDA by Example: An Introduction to General-Purpose GPU Programming
“This book is required reading for anyone working with accelerator-based computing systems.” –From the Foreword by …
video
Algorithms: 24-part Lecture Series
Algorithms, Deluxe Edition, Fourth Edition These Algorithms Video Lectures cover the essential information that every serious …
book
Advanced Algorithms and Data Structures
As a software engineer, you’ll encounter countless programming challenges that initially seem confusing, difficult, or even …