Book description
Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.
In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.
With this book, you will:
- Learn how to select Spark transformations for optimized solutions
- Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
- Understand data partitioning for optimized queries
- Build and apply a model using PySpark design patterns
- Apply motif-finding algorithms to graph data
- Analyze graph data by using the GraphFrames API
- Apply PySpark algorithms to clinical and genomics data
- Learn how to use and apply feature engineering in ML algorithms
- Understand and use practical and pragmatic data design patterns
Publisher resources
Table of contents
- Foreword
- Preface
- I. Fundamentals
- 1. Introduction to Spark and PySpark
- 2. Transformations in Action
- 3. Mapper Transformations
- 4. Reductions in Spark
- II. Working with Data
- 5. Partitioning Data
- 6. Graph Algorithms
-
7. Interacting with External Data Sources
- Relational Databases
- Reading Text Files
- Reading and Writing CSV Files
- Reading and Writing JSON Files
- Reading from and Writing to Amazon S3
- Reading and Writing Hadoop Files
- Reading and Writing Parquet Files
- Reading and Writing Avro Files
- Reading from and Writing to MS SQL Server
- Reading Image Files
- Summary
- 8. Ranking Algorithms
- III. Data Design Patterns
- 9. Classic Data Design Patterns
- 10. Practical Data Design Patterns
- 11. Join Design Patterns
- 12. Feature Engineering in PySpark
- Index
- About the Author
Product information
- Title: Data Algorithms with Spark
- Author(s):
- Release date: April 2022
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492082385
You might also like
book
Data Algorithms
If you are ready to dive into the MapReduce framework for processing large datasets, this practical …
book
Scaling Machine Learning with Spark
Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, …
video
Apache Spark with Python - Big Data with PySpark and Spark
This course covers all the fundamentals of Apache Spark with Python and teaches you everything you …
video
Apache Spark 3 for Data Engineering and Analytics with Python
Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will …