Parallel and High Performance Computing

Book description

Complex calculations, like training deep learning models or running large-scale simulations, can take an extremely long time. Efficient parallel programming can save hours—or even days—of computing time. Parallel and High Performance Computing shows you how to deliver faster run-times, greater scalability, and increased energy efficiency to your programs by mastering parallel techniques for multicore processor and GPU hardware.

About the Technology
Write fast, powerful, energy efficient programs that scale to tackle huge volumes of data. Using parallel programming, your code spreads data processing tasks across multiple CPUs for radically better performance. With a little help, you can create software that maximizes both speed and efficiency.

About the Book
Parallel and High Performance Computing offers techniques guaranteed to boost your code’s effectiveness. You’ll learn to evaluate hardware architectures and work with industry standard tools such as OpenMP and MPI. You’ll master the data structures and algorithms best suited for high performance computing and learn techniques that save energy on handheld devices. You’ll even run a massive tsunami simulation across a bank of GPUs.

What's Inside
  • Planning a new parallel project
  • Understanding differences in CPU and GPU architecture
  • Addressing underperforming kernels and loops
  • Managing applications with batch scheduling


About the Reader
For experienced programmers proficient with a high-performance computing language like C, C++, or Fortran.

About the Authors
Robert Robey works at Los Alamos National Laboratory and has been active in the field of parallel computing for over 30 years. Yuliana Zamora is currently a PhD student and Siebel Scholar at the University of Chicago, and has lectured on programming modern hardware at numerous national conferences.

Quotes
If you want to learn about parallel programming and high-performance computing based on practical and working examples, this book is for you.
- Tuan A. Tran, ExxonMobil

A great survey of recent advances on parallel and multi-processor software techniques.
- Albert Choy, OSI Digital Grid Solutions

An in-depth treatise on parallel computing from both a software- and hardware-optimized standpoint.
- Jean François Morin, Laval University

This book will show you how to design code that takes advantage of all the computing power modern computers offer.
- Alessandro Campeis, Vimar

Table of contents

  1. Parallel and High Performance Computing
  2. Copyright
  3. Dedication
  4. contents
  5. front matter
    1. foreword
      1. Yulie Zamora, University of Chicago, Illinois
    2. How we came to write this book
    3. acknowledgments
    4. about this book
    5. Who should read this book
  6. Part 1 Introduction to parallel computing
  7. 1 Why parallel computing?
    1. 1.1 Why should you learn about parallel computing?
      1. 1.1.1 What are the potential benefits of parallel computing?
      2. 1.1.2 Parallel computing cautions
    2. 1.2 The fundamental laws of parallel computing
      1. 1.2.1 The limit to parallel computing: Amdahl’s Law
      2. 1.2.2 Breaking through the parallel limit: Gustafson-Barsis’s Law
    3. 1.3 How does parallel computing work?
      1. 1.3.1 Walking through a sample application
      2. 1.3.2 A hardware model for today’s heterogeneous parallel systems
      3. 1.3.3 The application/software model for today’s heterogeneous parallel systems
    4. 1.4 Categorizing parallel approaches
    5. 1.5 Parallel strategies
    6. 1.6 Parallel speedup versus comparative speedups: Two different measures
    7. 1.7 What will you learn in this book?
      1. 1.7.1 Additional reading
      2. 1.7.2 Exercises
    8. Summary
  8. 2 Planning for parallelization
    1. 2.1 Approaching a new project: The preparation
      1. 2.1.1 Version control: Creating a safety vault for your parallel code
      2. 2.1.2 Test suites: The first step to creating a robust, reliable application
      3. 2.1.3 Finding and fixing memory issues
      4. 2.1.4 Improving code portability
    2. 2.2 Profiling: Probing the gap between system capabilities and application performance
    3. 2.3 Planning: A foundation for success
      1. 2.3.1 Exploring with benchmarks and mini-apps
      2. 2.3.2 Design of the core data structures and code modularity
      3. 2.3.3 Algorithms: Redesign for parallel
    4. 2.4 Implementation: Where it all happens
    5. 2.5 Commit: Wrapping it up with quality
    6. 2.6 Further explorations
      1. 2.6.1 Additional reading
      2. 2.6.2 Exercises
    7. Summary
  9. 3 Performance limits and profiling
    1. 3.1 Know your application’s potential performance limits
    2. 3.2 Determine your hardware capabilities: Benchmarking
      1. 3.2.1 Tools for gathering system characteristics
      2. 3.2.2 Calculating theoretical maximum flops
      3. 3.2.3 The memory hierarchy and theoretical memory bandwidth
      4. 3.2.4 Empirical measurement of bandwidth and flops
      5. 3.2.5 Calculating the machine balance between flops and bandwidth
    3. 3.3 Characterizing your application: Profiling
      1. 3.3.1 Profiling tools
      2. 3.3.2 Empirical measurement of processor clock frequency and energy consumption
      3. 3.3.3 Tracking memory during run time
    4. 3.4 Further explorations
      1. 3.4.1 Additional reading
      2. 3.4.2 Exercises
    5. Summary
  10. 4 Data design and performance models
    1. 4.1 Performance data structures: Data-oriented design
      1. 4.1.1 Multidimensional arrays
      2. 4.1.2 Array of Structures (AoS) versus Structures of Arrays (SoA)
        1. 4.1.3 Array of Structures of Arrays (AoSoA)
    2. 4.2 Three Cs of cache misses: Compulsory, capacity, conflict
    3. 4.3 Simple performance models: A case study
      1. 4.3.1 Full matrix data representations
      2. 4.3.2 Compressed sparse storage representations
    4. 4.4 Advanced performance models
    5. 4.5 Network messages
    6. 4.6 Further explorations
      1. 4.6.1 Additional reading
      2. 4.6.2 Exercises
    7. Summary
  11. 5 Parallel algorithms and patterns
    1. 5.1 Algorithm analysis for parallel computing applications
    2. 5.2 Performance models versus algorithmic complexity
    3. 5.3 Parallel algorithms: What are they?
    4. 5.4 What is a hash function?
    5. 5.5 Spatial hashing: A highly-parallel algorithm
      1. 5.5.1 Using perfect hashing for spatial mesh operations
      2. 5.5.2 Using compact hashing for spatial mesh operations
    6. 5.6 Prefix sum (scan) pattern and its importance in parallel computing
      1. 5.6.1 Step-efficient parallel scan operation
      2. 5.6.2 Work-efficient parallel scan operation
      3. 5.6.3 Parallel scan operations for large arrays
    7. 5.7 Parallel global sum: Addressing the problem of associativity
    8. 5.8 Future of parallel algorithm research
    9. 5.9 Further explorations
      1. 5.9.1 Additional reading
      2. 5.9.2 Exercises
    10. Summary
  12. Part 2 CPU: The parallel workhorse
  13. 6 Vectorization: FLOPs for free
    1. 6.1 Vectorization and single instruction, multiple data (SIMD) overview
    2. 6.2 Hardware trends for vectorization
    3. 6.3 Vectorization methods
      1. 6.3.1 Optimized libraries provide performance for little effort
      2. 6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time1)
      3. 6.3.3 Teaching the compiler through hints: Pragmas and directives
      4. 6.3.4 Crappy loops, we got them: Use vector intrinsics
      5. 6.3.5 Not for the faint of heart: Using assembler code for vectorization
    4. 6.4 Programming style for better vectorization
    5. 6.5 Compiler flags relevant for vectorization for various compilers
    6. 6.6 OpenMP SIMD directives for better portability
    7. 6.7 Further explorations
      1. 6.7.1 Additional reading
      2. 6.7.2 Exercises
    8. Summary
  14. 7 OpenMP that performs
    1. 7.1 OpenMP introduction
      1. 7.1.1 OpenMP concepts
      2. 7.1.2 A simple OpenMP program
    2. 7.2 Typical OpenMP use cases: Loop-level, high-level, and MPI plus OpenMP
      1. 7.2.1 Loop-level OpenMP for quick parallelization
      2. 7.2.2 High-level OpenMP for better parallel performance
      3. 7.2.3 MPI plus OpenMP for extreme scalability
    3. 7.3 Examples of standard loop-level OpenMP
      1. 7.3.1 Loop level OpenMP: Vector addition example
      2. 7.3.2 Stream triad example
      3. 7.3.3 Loop level OpenMP: Stencil example
      4. 7.3.4 Performance of loop-level examples
      5. 7.3.5 Reduction example of a global sum using OpenMP threading
      6. 7.3.6 Potential loop-level OpenMP issues
    4. 7.4 Variable scope importance for correctness in OpenMP
    5. 7.5 Function-level OpenMP: Making a whole function thread parallel
    6. 7.6 Improving parallel scalability with high-level OpenMP
      1. 7.6.1 How to implement high-level OpenMP
      2. 7.6.2 Example of implementing high-level OpenMP
    7. 7.7 Hybrid threading and vectorization with OpenMP
    8. 7.8 Advanced examples using OpenMP
      1. 7.8.1 Stencil example with a separate pass for the x and y directions
      2. 7.8.2 Kahan summation implementation with OpenMP threading
      3. 7.8.3 Threaded implementation of the prefix scan algorithm
    9. 7.9 Threading tools essential for robust implementations
      1. 7.9.1 Using Allinea/ARM MAP to get a quick high-level profile of your application
      2. 7.9.2 Finding your thread race conditions with Intel® Inspector
    10. 7.10 Example of a task-based support algorithm
    11. 7.11 Further explorations
      1. 7.11.1 Additional reading
      2. 7.11.2 Exercises
    12. Summary
  15. 8 MPI: The parallel backbone
    1. 8.1 The basics for an MPI program
      1. 8.1.1 Basic MPI function calls for every MPI program
      2. 8.1.2 Compiler wrappers for simpler MPI programs
      3. 8.1.3 Using parallel startup commands
      4. 8.1.4 Minimum working example of an MPI program
    2. 8.2 The send and receive commands for process-to-process communication
    3. 8.3 Collective communication: A powerful component of MPI
      1. 8.3.1 Using a barrier to synchronize timers
      2. 8.3.2 Using the broadcast to handle small file input
      3. 8.3.3 Using a reduction to get a single value from across all processes
      4. 8.3.4 Using gather to put order in debug printouts
      5. 8.3.5 Using scatter and gather to send data out to processes for work
    4. 8.4 Data parallel examples
      1. 8.4.1 Stream triad to measure bandwidth on the node
      2. 8.4.2 Ghost cell exchanges in a two-dimensional (2D) mesh
      3. 8.4.3 Ghost cell exchanges in a three-dimensional (3D) stencil calculation
    5. 8.5 Advanced MPI functionality to simplify code and enable optimizations
      1. 8.5.1 Using custom MPI data types for performance and code simplification
      2. 8.5.2 Cartesian topology support in MPI
      3. 8.5.3 Performance tests of ghost cell exchange variants
    6. 8.6 Hybrid MPI plus OpenMP for extreme scalability
      1. 8.6.1 The benefits of hybrid MPI plus OpenMP
      2. 8.6.2 MPI plus OpenMP example
    7. 8.7 Further explorations
      1. 8.7.1 Additional reading
      2. 8.7.2 Exercises
    8. Summary
  16. Part 3 GPUs: Built to accelerate
  17. 9 GPU architectures and concepts
    1. 9.1 The CPU-GPU system as an accelerated computational platform
      1. 9.1.1 Integrated GPUs: An underused option on commodity-based systems
      2. 9.1.2 Dedicated GPUs: The workhorse option
    2. 9.2 The GPU and the thread engine
      1. 9.2.1 The compute unit is the streaming multiprocessor (or subslice)
      2. 9.2.2 Processing elements are the individual processors
      3. 9.2.3 Multiple data operations by each processing element
      4. 9.2.4 Calculating the peak theoretical flops for some leading GPUs
    3. 9.3 Characteristics of GPU memory spaces
      1. 9.3.1 Calculating theoretical peak memory bandwidth
      2. 9.3.2 Measuring the GPU stream benchmark
      3. 9.3.3 Roofline performance model for GPUs
      4. 9.3.4 Using the mixbench performance tool to choose the best GPU for a workload
    4. 9.4 The PCI bus: CPU to GPU data transfer overhead
      1. 9.4.1 Theoretical bandwidth of the PCI bus
      2. 9.4.2 A benchmark application for PCI bandwidth
    5. 9.5 Multi-GPU platforms and MPI
      1. 9.5.1 Optimizing the data movement between GPUs across the network
      2. 9.5.2 A higher performance alternative to the PCI bus
    6. 9.6 Potential benefits of GPU-accelerated platforms
      1. 9.6.1 Reducing time-to-solution
      2. 9.6.2 Reducing energy use with GPUs
      3. 9.6.3 Reduction in cloud computing costs with GPUs
    7. 9.7 When to use GPUs
    8. 9.8 Further explorations
      1. 9.8.1 Additional reading
      2. 9.8.2 Exercises
    9. Summary
  18. 10 GPU programming model
    1. 10.1 GPU programming abstractions: A common framework
      1. 10.1.1 Massive parallelism
      2. 10.1.2 Inability to coordinate among tasks
      3. 10.1.3 Terminology for GPU parallelism
      4. 10.1.4 Data decomposition into independent units of work: An NDRange or grid
      5. 10.1.5 Work groups provide a right-sized chunk of work
      6. 10.1.6 Subgroups, warps, or wavefronts execute in lockstep
      7. 10.1.7 Work item: The basic unit of operation
      8. 10.1.8 SIMD or vector hardware
    2. 10.2 The code structure for the GPU programming model
      1. 10.2.1 “Me” programming: The concept of a parallel kernel
      2. 10.2.2 Thread indices: Mapping the local tile to the global world
      3. 10.2.3 Index sets
      4. 10.2.4 How to address memory resources in your GPU programming model
    3. 10.3 Optimizing GPU resource usage
      1. 10.3.1 How many registers does my kernel use?
      2. 10.3.2 Occupancy: Making more work available for work group scheduling
    4. 10.4 Reduction pattern requires synchronization across work groups
    5. 10.5 Asynchronous computing through queues (streams)
    6. 10.6 Developing a plan to parallelize an application for GPUs
      1. 10.6.1 Case 1: 3D atmospheric simulation
      2. 10.6.2 Case 2: Unstructured mesh application
    7. 10.7 Further explorations
      1. 10.7.1 Additional reading
      2. 10.7.2 Exercises
    8. Summary
  19. 11 Directive-based GPU programming
    1. 11.1 Process to apply directives and pragmas for a GPU implementation
    2. 11.2 OpenACC: The easiest way to run on your GPU
      1. 11.2.1 Compiling OpenACC code
      2. 11.2.2 Parallel compute regions in OpenACC for accelerating computations
      3. 11.2.3 Using directives to reduce data movement between the CPU and the GPU
      4. 11.2.4 Optimizing the GPU kernels
      5. 11.2.5 Summary of performance results for the stream triad
      6. 11.2.6 Advanced OpenACC techniques
    3. 11.3 OpenMP: The heavyweight champ enters the world of accelerators
      1. 11.3.1 Compiling OpenMP code
      2. 11.3.2 Generating parallel work on the GPU with OpenMP
      3. 11.3.3 Creating data regions to control data movement to the GPU with OpenMP
      4. 11.3.4 Optimizing OpenMP for GPUs
      5. 11.3.5 Advanced OpenMP for GPUs
    4. 11.4 Further explorations
      1. 11.4.1 Additional reading
      2. 11.4.2 Exercises
    5. Summary
  20. 12 GPU languages: Getting down to basics
    1. Figure 12.1 The interoperability map for the GPU languages shows an increasingly complex situation. Four GPU languages are shown at the top with the various hardware devices at the bottom. The arrows show the code generation pathways from the languages to the hardware. The dashed lines are for hardware that is still in development.
    2. 12.1 Features of a native GPU programming language
    3. 12.2 CUDA and HIP GPU languages: The low-level performance option
      1. 12.2.1 Writing and building your first CUDA application
      2. 12.2.2 A reduction kernel in CUDA: Life gets complicated
      3. Figure 12.2 Pair-wise reduction tree for a warp that sums up values in log n steps.
      4. 12.2.3 Hipifying the CUDA code
    4. 12.3 OpenCL for a portable open source GPU language
      1. 12.3.1 Writing and building your first OpenCL application
      2. 12.3.2 Reductions in OpenCL
      3. Figure 12.3 Comparison of OpenCL and CUDA reduction kernels: sum_within_block
      4. Figure 12.4 Comparison for the first of two kernel passes for the OpenCL and CUDA reduction kernels
      5. Figure 12.5 Comparison of the second pass for the reduction sum
    5. 12.4 SYCL: An experimental C++ implementation goes mainstream
    6. 12.5 Higher-level languages for performance portability
      1. 12.5.1 Kokkos: A performance portability ecosystem
      2. 12.5.2 RAJA for a more adaptable performance portability layer
    7. 12.6 Further explorations
      1. 12.6.1 Additional reading
      2. 12.6.2 Exercises
    8. Summary
  21. 13 GPU profiling and tools
    1. 13.1 An overview of profiling tools
    2. 13.2 How to select a good workflow
    3. 13.3 Example problem: Shallow water simulation
    4. 13.4 A sample of a profiling workflow
      1. 13.4.1 Run the shallow water application
      2. 13.4.2 Profile the CPU code to develop a plan of action
      3. 13.4.3 Add OpenACC compute directives to begin the implementation step
      4. 13.4.4 Add data movement directives
      5. 13.4.5 Guided analysis can give you some suggested improvements
      6. 13.4.6 The NVIDIA Nsight suite of tools can be a powerful development aid
      7. 13.4.7 CodeXL for the AMD GPU ecosystem
    5. 13.5 Don’t get lost in the swamp: Focus on the important metrics
      1. 13.5.1 Occupancy: Is there enough work?
      2. 13.5.2 Issue efficiency: Are your warps on break too often?
      3. 13.5.3 Achieved bandwidth: It always comes down to bandwidth
    6. 13.6 Containers and virtual machines provide alternate workflows
      1. 13.6.1 Docker containers as a workaround
      2. 13.6.2 Virtual machines using VirtualBox
    7. 13.7 Cloud options: A flexible and portable capability
    8. 13.8 Further explorations
      1. 13.8.1 Additional reading
      2. 13.8.2 Exercises
    9. Summary
  22. Part 4 High performance computing ecosystems
  23. 14 Affinity: Truce with the kernel
    1. 14.1 Why is affinity important?
    2. 14.2 Discovering your architecture
    3. 14.3 Thread affinity with OpenMP
    4. 14.4 Process affinity with MPI
      1. 14.4.1 Default process placement with OpenMPI
      2. 14.4.2 Taking control: Basic techniques for specifying process placement in OpenMPI
      3. 14.4.3 Affinity is more than just process binding: The full picture
    5. 14.5 Affinity for MPI plus OpenMP
    6. 14.6 Controlling affinity from the command line
      1. 14.6.1 Using hwloc-bind to assign affinity
      2. 14.6.2 Using likwid-pin: An affinity tool in the likwid tool suite
    7. 14.7 The future: Setting and changing affinity at run time
      1. 14.7.1 Setting affinities in your executable
      2. 14.7.2 Changing your process affinities during run time
    8. 14.8 Further explorations
      1. 14.8.1 Additional reading
      2. 14.8.2 Exercises
    9. Summary
  24. 15 Batch schedulers:Bringing order to chaos
    1. 15.1 The chaos of an unmanaged system
    2. 15.2 How not to be a nuisance when working on a busy cluster
      1. 15.2.1 Layout of a batch system for busy clusters
      2. 15.2.2 How to be courteous on busy clusters and HPC sites: Common HPC pet peeves
    3. 15.3 Submitting your first batch script
    4. 15.4 Automatic restarts for long-running jobs
    5. 15.5 Specifying dependencies in batch scripts
    6. 15.6 Further explorations
      1. 15.6.1 Additional reading
      2. 15.6.2 Exercises
    7. Summary
  25. 16 File operations for a parallel world
    1. 16.1 The components of a high-performance filesystem
    2. 16.2 Standard file operations: A parallel-to-serial interface
    3. 16.3 MPI file operations (MPI-IO) for a more parallel world
    4. 16.4 HDF5 is self-describing for better data management
    5. 16.5 Other parallel file software packages
    6. 16.6 Parallel filesystem: The hardware interface
      1. 16.6.1 Everything you wanted to know about your parallel file setup but didn’t know how to ask
      2. 16.6.2 General hints that apply to all filesystems
      3. 16.6.3 Hints specific to particular filesystems
    7. 16.7 Further explorations
      1. 16.7.1 Additional reading
      2. 16.7.2 Exercises
    8. Summary
  26. 17 Tools and resources for better code
    1. 17.1 Version control systems: It all begins here
      1. 17.1.1 Distributed version control fits the more mobile world
      2. 17.1.2 Centralized version control for simplicity and code security
    2. 17.2 Timer routines for tracking code performance
    3. 17.3 Profilers: You can’t improve what you don’t measure
      1. 17.3.1 Simple text-based profilers for everyday use
      2. 17.3.2 High-level profilers for quickly identifying bottlenecks
      3. 17.3.3 Medium-level profilers to guide your application development
      4. 17.3.4 Detailed profilers give the gory details of hardware performance
    4. 17.4 Benchmarks and mini-apps: A window into system performance
      1. 17.4.1 Benchmarks measure system performance characteristics
      2. 17.4.2 Mini-apps give the application perspective
    5. 17.5 Detecting (and fixing) memory errors for a robust application
      1. 17.5.1 Valgrind Memcheck: The open source standby
      2. 17.5.2 Dr. Memory for your memory ailments
      3. 17.5.3 Commercial memory tools for demanding applications
      4. 17.5.4 Compiler-based memory tools for convenience
      5. 17.5.5 Fence-post checkers detect out-of-bounds memory accesses
      6. 17.5.6 GPU memory tools for robust GPU applications
    6. 17.6 Thread checkers for detecting race conditions
      1. 17.6.1 Intel® Inspector: A race condition detection tool with a GUI
      2. 17.6.2 Archer: A text-based tool for detecting race conditions
    7. 17.7 Bug-busters: Debuggers to exterminate those bugs
      1. 17.7.1 TotalView debugger is widely available at HPC sites
      2. 17.7.2 DDT is another debugger widely available at HPC sites
      3. 17.7.3 Linux debuggers: Free alternatives for your local development needs
      4. 17.7.4 GPU debuggers can help crush those GPU bugs
    8. 17.8 Profiling those file operations
    9. 17.9 Package managers: Your personal system administrator
      1. 17.9.1 Package managers for macOS
      2. 17.9.2 Package managers for Windows
      3. 17.9.3 The Spack package manager: A package manager for high performance computing
    10. 17.10 Modules: Loading specialized toolchains
      1. 17.10.1 TCL modules: The original modules system for loading software toolchains
      2. 17.10.2 Lmod: A Lua-based alternative Modules implementation
    11. 17.11 Reflections and exercises
    12. Summary
  27. appendix A References
  28. appendix B Solutions to exercises
  29. appendix C Glossary
  30. index

Product information

  • Title: Parallel and High Performance Computing
  • Author(s): Yuliana Zamora, Robert Robey
  • Release date: July 2021
  • Publisher(s): Manning Publications
  • ISBN: 9781617296468