Book description
The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop.About the Technology
Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem.
About the Book
Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms.
What's Inside
- Writing Spark applications in Java
- Spark application architecture
- Ingestion through files, databases, streaming, and Elasticsearch
- Querying distributed datasets with Spark SQL
About the Reader
This book does not assume previous experience with Spark, Scala, or Hadoop.
About the Author
Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years.
Quotes
This book reveals the tools and secrets you need to drive innovation in your company or community.
- Rob Thomas, IBM
An indispensable, well-paced, and in-depth guide. A must-have for anyone into big data and real-time stream processing.
- Anupam Sengupta, GuardHat Inc.
This book will help spark a love affair with distributed processing.
- Conor Redmond, InComm Product Control
Currently the best book on the subject!
- Markus Breuer, Materna IPS
Table of contents
- Copyright
- brief contents
- contents
- front matter
- Part 1. The theory crippled by awesome examples
- 1. So, what is Spark, anyway?
- 2. Architecture and flow
- 3. The majestic role of the dataframe
- 4. Fundamentally lazy
- 5. Building a simple app for deployment
- 6. Deploying your simple app
- Part 2. Ingestion
- 7. Ingestion from files
- 8. Ingestion from databases
-
9 Advanced ingestion: finding data sources and building your own
- 9.1 What is a data source?
- 9.2 Benefits of a direct connection to a data source
- 9.3 Finding data sources at Spark Packages
- 9.4 Building your own data source
- 9.5 Behind the scenes: Building the data source itself
- 9.6 Using the register file and the advertiser class
- 9.7 Understanding the relationship between the data and schema
- 9.8 Building the schema from a JavaBean
- 9.9 Building the dataframe is magic with the utilities
- 9.10 The other classes
- Summary
- 10. Ingestion through structured streaming
- Part 3. Transforming your data
- 11. Working with SQL
- 12 Transforming your data
- 13. Transforming entire documents
- 14. Extending transformations with user-defined functions
- 15. Aggregating your data
- Part 4. Going further
- 16. Cache and checkpoint: Enhancing Spark’s performances
- 17. Exporting data and building full data pipelines
-
18. Exploring deployment constraints: Understanding the ecosystem
- 18.1 Managing resources with YARN, Mesos, and Kubernetes
-
18.2 Sharing files with Spark
- 18.2.1 Accessing the data contained in files
- 18.2.2 Sharing files through distributed filesystems
- 18.2.3 Accessing files on shared drives or file server
- 18.2.4 Using file-sharing services to distribute files
- 18.2.5 Other options for accessing files in Spark
- 18.2.6 Hybrid solution for sharing files with Spark
- 18.3 Making sure your Spark application is secure
- Summary
- Appendixes.
- Appendix A. Installing Eclipse
- Appendix B. Installing Maven
- Appendix C. Installing Git
- Appendix D. Downloading the code and getting started with Eclipse
- Appendix E. A history of enterprise data
- Appendix F. Getting help with relational databases
-
Appendix G. Static functions ease your transformations
-
G.1.1 Functions per category
- G.1.1 Popular functions
- G.1.2 Aggregate functions
- G.1.3 Arithmetical functions
- G.1.4 Array manipulation functions
- G.1.5 Binary operations
- G.1.6 Byte functions
- G.1.7 Comparison functions
- G.1.8 Compute function
- G.1.9 Conditional operations
- G.1.10 Conversion functions
- G.1.11 Data shape functions
- G.1.12 Date and time functions
- G.1.13 Digest functions
- G.1.14 Encoding functions
- G.1.15 Formatting functions
- G.1.16 JSON functions
- G.1.17 List functions
- G.1.18 Map functions
- G.1.19 Mathematical functions
- G.1.20 Navigation functions
- G.1.21 Parsing functions
- G.1.22 Partition functions
- G.1.23 Rounding functions
- G.1.24 Sorting functions
- G.1.25 Statistical functions
- G.1.26 Streaming functions
- G.1.27 String functions
- G.1.28 Technical functions
- G.1.29 Trigonometry functions
- G.1.30 UDF helpers
- G.1.31 Validation functions
- G.1.32 Deprecated functions
-
G.2 Function appearance per version of Spark
- G.2.1 Functions in Spark v3.0.0
- G.2.2 Functions in Spark v2.4.0
- G.2.3 Functions in Spark v2.3.0
- G.2.4 Functions in Spark v2.2.0
- G.2.5 Functions in Spark v2.1.0
- G.2.6 Functions in Spark v2.0.0
- G.2.7 Functions in Spark v1.6.0
- G.2.8 Functions in Spark v1.5.0
- G.2.9 Functions in Spark v1.4.0
- G.2.10 Functions in Spark v1.3.0
-
G.1.1 Functions per category
- Appendix H. Maven quick cheat sheet
- Appendix I. Reference for transformations and actions
- Appendix J. Enough Scala
- Appendix K. Installing Spark in production and a few tips
- Appendix L. Reference for ingestion
- Appendix M. Reference for joins
- Appendix N. Installing Elasticsearch and sample data
- Appendix O. Generating streaming data
- Appendix P. Reference for streaming
- Appendix Q. Reference for exporting data
- Appendix R. Finding help when you’re stuck
- index
Product information
- Title: Spark in Action, Second Edition
- Author(s):
- Release date: June 2020
- Publisher(s): Manning Publications
- ISBN: 9781617295522
You might also like
book
Spark in Action
Spark in Action teaches you the theory and skills you need to effectively handle batch and …
book
Learning Spark
Data in all domains is getting bigger. How can you work with it efficiently? Recently updated …
book
High Performance Spark
Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you …
book
Spark: The Definitive Guide
Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the …