Book description
Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.
Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.
With this book, you’ll explore:
- How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
- The choice between data joins in Core Spark and Spark SQL
- Techniques for getting the most out of standard RDD transformations
- How to work around performance issues in Spark’s key/value pair paradigm
- Writing high-performance Spark code without Scala or the JVM
- How to test for functionality and performance when applying suggested improvements
- Using Spark MLlib and Spark ML machine learning libraries
- Spark’s Streaming components and external community packages
Publisher resources
Table of contents
- Preface
- 1. Introduction to High Performance Spark
- 2. How Spark Works
-
3. DataFrames, Datasets, and Spark SQL
- Getting Started with the SparkSession (or HiveContext or SQLContext)
- Spark SQL Dependencies
- Basics of Schemas
- DataFrame API
- Data Representation in DataFrames and Datasets
- Data Loading and Saving Functions
- Datasets
- Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- Query Optimizer
- Debugging Spark SQL Queries
- JDBC/ODBC Server
- Conclusion
- 4. Joins (SQL and Core)
- 5. Effective Transformations
-
6. Working with Key/Value Data
- The Goldilocks Example
- Actions on Key/Value Pairs
- What’s So Dangerous About the groupByKey Function
- Choosing an Aggregation Operation
- Multiple RDD Operations
- Partitioners and Key/Value Data
- Dictionary of OrderedRDDOperations
- Secondary Sort and repartitionAndSortWithinPartitions
- Straggler Detection and Unbalanced Data
- Conclusion
- 7. Going Beyond Scala
- 8. Testing and Validation
-
9. Spark MLlib and ML
- Choosing Between Spark MLlib and Spark ML
- Working with MLlib
-
Working with Spark ML
- Spark ML Organization and Imports
- Pipeline Stages
- Explain Params
- Data Encoding
- Data Cleaning
- Spark ML Models
- Putting It All Together in a Pipeline
- Training a Pipeline
- Accessing Individual Stages
- Data Persistence and Spark ML
- Extending Spark ML Pipelines with Your Own Algorithms
- Model and Pipeline Persistence and Serving with Spark ML
- General Serving Considerations
- Conclusion
- 10. Spark Components and Packages
- A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
- Index
Product information
- Title: High Performance Spark
- Author(s):
- Release date: May 2017
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491943151
You might also like
book
High Performance Spark, 2nd Edition
Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you …
book
Learning Spark
Data in all domains is getting bigger. How can you work with it efficiently? Recently updated …
book
Spark in Action, Second Edition
The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data …
book
Spark: The Definitive Guide
Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the …