High Performance Spark, 2nd Edition

Book description

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau, Rachel Warren, and Anya Bida walk you through the secrets of the Spark code base, and demonstrate performance optimizations that will help your data pipelines run faster, scale to larger datasets, and avoid costly antipatterns.

Ideal for data engineers, software engineers, data scientists, and system administrators, the second edition of High Performance Spark presents new use cases, code examples, and best practices for Spark 3.x and beyond. This book gives you a fresh perspective on this continually evolving framework and shows you how to work around bumps on your Spark and PySpark journey.

With this book, you'll learn how to:

  • Accelerate your ML workflows with integrations including PyTorch
  • Handle key skew and take advantage of Spark's new dynamic partitioning
  • Make your code reliable with scalable testing and validation techniques
  • Make Spark high performance
  • Deploy Spark on Kubernetes and similar environments
  • Take advantage of GPU acceleration with RAPIDS and resource profiles
  • Get your Spark jobs to run faster
  • Use Spark to productionize exploratory data science projects
  • Handle even larger datasets with Spark
  • Gain faster insights by reducing pipeline running times

Publisher resources

View/Submit Errata

Table of contents

  1. Brief Table of Contents (Not Yet Final)
  2. 1. Introduction to High Performance Spark
    1. What Is Spark and Why Performance Matters
      1. What You Can Expect to Get from This Book
      2. Spark Versions
      3. Why the focus on Scala and Python?
      4. The Spark Scala and Python APIs Are Easier to Use Than the Java API
      5. Why Not Scala?
      6. Learning Scala
    2. Conclusion
  3. 2. Upgrading Spark
    1. Finding What You Need to Change
      1. Compile Time Changes
      2. Exceptions at Runtime
      3. Differing Results
    2. Updating Your Code
      1. Scala
      2. Python
      3. SQL
    3. Verifying Correctness and Performance
    4. Conclusion
  4. 3. Going Beyond Scala
    1. Beyond Scala within the JVM
    2. Custom Code Beyond Scala, and Beyond the JVM
      1. How PySpark Works
      2. How SparkR Works
      3. Spark on the Common Language Runtime (CLR)—C# and Friends
    3. Calling Other Languages from Spark
      1. Using Pipe and Friends
      2. JNI
      3. Java Native Access (JNA)
      4. Project Panama
      5. Underneath Everything Is FORTRAN
      6. Getting to the GPU
    4. Going Beyond the JVM with Spark Accelerators
      1. Databricks Photon
      2. Apache Arrow Comet Datafusion
      3. Project Gluten
      4. Spark RAPIDs
      5. Application-Specific Integrated Circuits (ASICS)
    5. Managing Memory Outside of the JVM
    6. The future (from ~2024)
    7. Conclusion
  5. 4. Testing, Validation, and Side-By-Side runs
    1. Unit Testing
      1. Factoring your code for testability
      2. Mocking RDDs
      3. Core Spark jobs (testing with RDDs)
      4. DStream Streaming
      5. Testing DataFrames
    2. Testing Codegen
    3. Getting Test Data
      1. Generating Large Datasets
      2. Sampling
    4. Property Checking with ScalaCheck
      1. Computing RDD Difference
    5. Integration Testing
      1. Choosing Your Integration Testing Environment
    6. Verifying Performance
      1. Spark Counters for Verifying Performance
      2. Projects for Verifying Performance
    7. Validation (or Audits)
      1. Data Validation
      2. Counters Built-in and Accumulators
    8. Side by Side Runs
    9. Conclusion
  6. About the Authors

Product information

  • Title: High Performance Spark, 2nd Edition
  • Author(s): Holden Karau, Adi Polak, Rachel Warren
  • Release date: August 2025
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098145859