Apache Spark 2: Data Processing and Real-Time Analytics

Book description

Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework

Key Features

  • Master the art of real-time big data processing and machine learning
  • Explore a wide range of use-cases to analyze large data
  • Discover ways to optimize your work by using many features of Spark 2.x and Scala

Book Description

Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform.

You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools.

By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle.

This Learning Path includes content from the following Packt products:

  • Mastering Apache Spark 2.x by Romeo Kienzler
  • Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla
  • Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook

What you will learn

  • Get to grips with all the features of Apache Spark 2.x
  • Perform highly optimized real-time big data processing
  • Use ML and DL techniques with Spark MLlib and third-party tools
  • Analyze structured and unstructured data using SparkSQL and GraphX
  • Understand tuning, debugging, and monitoring of big data applications
  • Build scalable and fault-tolerant streaming applications
  • Develop scalable recommendation engines

Who this book is for

If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.

Table of contents

  1. Title Page
  2. Copyright
    1. Apache Spark 2: Data Processing and Real-Time Analytics
  3. About Packt
    1. Why Subscribe?
    2. Packt.com
  4. Contributors
    1. About the Authors
    2. Packt Is Searching for Authors Like You
  5. Preface
    1. Who This Book Is For
    2. What This Book Covers
    3. To Get the Most out of This Book
      1. Download the Example Code Files
      2. Conventions Used
    4. Get in Touch
      1. Reviews
  6. A First Taste and What's New in Apache Spark V2
    1. Spark machine learning
    2. Spark Streaming
    3. Spark SQL
    4. Spark graph processing
    5. Extended ecosystem
    6. What's new in Apache Spark V2?
    7. Cluster design
    8. Cluster management
      1. Local
      2. Standalone
      3. Apache YARN
      4. Apache Mesos
    9. Cloud-based deployments
    10. Performance
      1. The cluster structure
      2. Hadoop Distributed File System
      3. Data locality
      4. Memory
      5. Coding
    11. Cloud
    12. Summary
  7. Apache Spark Streaming
    1. Overview
    2. Errors and recovery
      1. Checkpointing
    3. Streaming sources
      1. TCP stream
      2. File streams
      3. Flume
      4. Kafka
    4. Summary
  8. Structured Streaming
    1. The concept of continuous applications
      1. True unification - same code, same engine
    2. Windowing
      1. How streaming engines use windowing
      2. How Apache Spark improves windowing
    3. Increased performance with good old friends
    4. How transparent fault tolerance and exactly-once delivery guarantee is achieved
      1. Replayable sources can replay streams from a given offset
      2. Idempotent sinks prevent data duplication
      3. State versioning guarantees consistent results after reruns
    5. Example - connection to a MQTT message broker
      1. Controlling continuous applications
      2. More on stream life cycle management
    6. Summary
  9. Apache Spark MLlib
    1. Architecture
      1. The development environment
    2. Classification with Naive Bayes
      1. Theory on Classification
      2. Naive Bayes in practice
    3. Clustering with K-Means
      1. Theory on Clustering
      2. K-Means in practice
    4. Artificial neural networks
      1. ANN in practice
    5. Summary
  10. Apache SparkML
    1. What does the new API look like?
    2. The concept of pipelines
      1. Transformers
        1. String indexer
        2. OneHotEncoder
        3. VectorAssembler
      2. Pipelines
      3. Estimators
        1. RandomForestClassifier
    3. Model evaluation
    4. CrossValidation and hyperparameter tuning
      1. CrossValidation
      2. Hyperparameter tuning
    5. Winning a Kaggle competition with Apache SparkML
      1. Data preparation
      2. Feature engineering
      3. Testing the feature engineering pipeline
      4. Training the machine learning model
      5. Model evaluation
      6. CrossValidation and hyperparameter tuning
      7. Using the evaluator to assess the quality of the cross-validated and tuned model
    6. Summary
  11. Apache SystemML
    1. Why do we need just another library?
      1. Why on Apache Spark?
      2. The history of Apache SystemML
    2. A cost-based optimizer for machine learning algorithms
      1. An example - alternating least squares
      2. ApacheSystemML architecture
        1. Language parsing
        2. High-level operators are generated
        3. How low-level operators are optimized on
    3. Performance measurements
    4. Apache SystemML in action
    5. Summary
  12. Apache Spark GraphX
    1. Overview
    2. Graph analytics/processing with GraphX
      1. The raw data
      2. Creating a graph
      3. Example 1 – counting
      4. Example 2 – filtering
      5. Example 3 – PageRank
      6. Example 4 – triangle counting
      7. Example 5 – connected components
    3. Summary
  13. Spark Tuning
    1. Monitoring Spark jobs
      1. Spark web interface
        1. Jobs
        2. Stages
        3. Storage
        4. Environment
        5. Executors
        6. SQL
      2. Visualizing Spark application using web UI
        1. Observing the running and completed Spark jobs
        2. Debugging Spark applications using logs
        3. Logging with log4j with Spark
    2. Spark configuration
      1. Spark properties
      2. Environmental variables
      3. Logging
    3. Common mistakes in Spark app development
      1. Application failure
        1. Slow jobs or unresponsiveness
    4. Optimization techniques
      1. Data serialization
      2. Memory tuning
        1. Memory usage and management
        2. Tuning the data structures
        3. Serialized RDD storage
        4. Garbage collection tuning
        5. Level of parallelism
        6. Broadcasting
        7. Data locality
    5. Summary
  14. Testing and Debugging Spark
    1. Testing in a distributed environment
      1. Distributed environment
        1. Issues in a distributed system
        2. Challenges of software testing in a distributed environment
    2. Testing Spark applications
      1. Testing Scala methods
      2. Unit testing
      3. Testing Spark applications
        1. Method 1: Using Scala JUnit test
        2. Method 2: Testing Scala code using FunSuite
        3. Method 3: Making life easier with Spark testing base
      4. Configuring Hadoop runtime on Windows
    3. Debugging Spark applications
      1. Logging with log4j with Spark recap
      2. Debugging the Spark application
        1. Debugging Spark application on Eclipse as Scala debug
        2. Debugging Spark jobs running as local and standalone mode
        3. Debugging Spark applications on YARN or Mesos cluster
        4. Debugging Spark application using SBT
    4. Summary
  15. Practical Machine Learning with Spark Using Scala
    1. Introduction
      1. Apache Spark
      2. Machine learning
      3. Scala
      4. Software versions and libraries used in this book
    2. Configuring IntelliJ to work with Spark and run Spark ML sample codes
      1. Getting ready
      2. How to do it...
      3. There's more...
      4. See also
    3. Running a sample ML code from Spark
      1. Getting ready
      2. How to do it...
    4. Identifying data sources for practical machine learning
      1. Getting ready
      2. How to do it...
      3. See also
    5. Running your first program using Apache Spark 2.0 with the IntelliJ IDE
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. How to add graphics to your Spark program
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  16. Spark's Three Data Musketeers for Machine Learning - Perfect Together
    1. Introduction
      1. RDDs - what started it all...
      2. DataFrame - a natural evolution to unite API and SQL via a high-level API
      3. Dataset - a high-level unifying Data API
    2. Creating RDDs with Spark 2.0 using internal data sources
      1. How to do it...
      2. How it works...
    3. Creating RDDs with Spark 2.0 using external data sources
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Transforming RDDs with Spark 2.0 using the filter() API
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Transforming RDDs with the super useful flatMap() API
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Transforming RDDs with set operation APIs
      1. How to do it...
      2. How it works...
      3. See also
    7. RDD transformation/aggregation with groupBy() and reduceByKey()
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. Transforming RDDs with the zip() API
      1. How to do it...
      2. How it works...
      3. See also
    9. Join transformation with paired key-value RDDs
      1. How to do it...
      2. How it works...
      3. There's more...
    10. Reduce and grouping transformation with paired key-value RDDs
      1. How to do it...
      2. How it works...
      3. See also
    11. Creating DataFrames from Scala data structures
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    12. Operating on DataFrames programmatically without SQL
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    13. Loading DataFrames and setup from an external source
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    14. Using DataFrames with standard SQL language - SparkSQL
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    15. Working with the Dataset API using a Scala Sequence
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    16. Creating and using Datasets from RDDs and back again
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    17. Working with JSON using the Dataset API and SQL together
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    18. Functional programming with the Dataset API using domain objects
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  17. Common Recipes for Implementing a Robust Machine Learning System
    1. Introduction
    2. Spark's basic statistical API to help you build your own algorithms
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. ML pipelines for real-life machine learning applications
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Normalizing data with Spark
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Splitting data for training and testing
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Common operations with the new Dataset API
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. LabeledPoint data structure for Spark ML
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    9. Getting access to Spark cluster in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    10. Getting access to Spark cluster pre-Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    11. Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    12. New model export and PMML markup in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    13. Regression model evaluation using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    14. Binary classification model evaluation using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    15. Multiclass classification model evaluation using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    16. Multilabel classification model evaluation using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    17. Using the Scala Breeze library to do graphics in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  18. Recommendation Engine that Scales with Spark
    1. Introduction
      1. Content filtering
      2. Collaborative filtering
      3. Neighborhood method
      4. Latent factor models techniques
    2. Setting up the required data for a scalable recommendation engine in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Exploring the movies data details for the recommendation system in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Exploring the ratings data details for the recommendation system in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Building a scalable recommendation engine using collaborative filtering in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
        1. Dealing with implicit input for training
  19. Unsupervised Clustering with Apache Spark 2.0
    1. Introduction
    2. Building a KMeans classifying system in Spark 2.0
      1. How to do it...
      2. How it works...
        1. KMeans (Lloyd Algorithm)
        2. KMeans++ (Arthur's algorithm)
        3. KMeans|| (pronounced as KMeans Parallel)
      3. There's more...
      4. See also
    3. Bisecting KMeans, the new kid on the block in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
      1. How to do it...
      2. How it works...
        1. New GaussianMixture()
      3. There's more...
      4. See also
    5. Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Latent Dirichlet Allocation (LDA) to classify documents and text into topics
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Streaming KMeans to classify data in near real-time
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  20. Implementing Text Analytics with Spark 2.0 ML Library
    1. Introduction
    2. Doing term frequency with Spark - everything that counts
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Displaying similar words with Spark using Word2Vec
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Downloading a complete dump of Wikipedia for a real-life Spark ML project
      1. How to do it...
      2. There's more...
      3. See also
    5. Using Latent Semantic Analysis for text analytics with Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Topic modeling with Latent Dirichlet allocation in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  21. Spark Streaming and Machine Learning Library
    1. Introduction
    2. Structured streaming for near real-time machine learning
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Streaming DataFrames for real-time machine learning
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Streaming Datasets for real-time machine learning
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Streaming data and debugging with queueStream
      1. How to do it...
      2. How it works...
      3. See also
    6. Downloading and understanding the famous Iris data for unsupervised classification
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Streaming KMeans for a real-time on-line classifier
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. Downloading wine quality data for streaming regression
      1. How to do it...
      2. How it works...
      3. There's more...
    9. Streaming linear regression for a real-time regression
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    10. Downloading Pima Diabetes data for supervised classification
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    11. Streaming logistic regression for an on-line classifier
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  22. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Apache Spark 2: Data Processing and Real-Time Analytics
  • Author(s): Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi
  • Release date: December 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781789959208