Book description
Over insightful 90 recipes to get lightning-fast analytics with Apache Spark
About This Book
- Use Apache Spark for data processing with these hands-on recipes
- Implement end-to-end, large-scale data analysis better than ever before
- Work with powerful libraries such as MLLib, SciPy, NumPy, and Pandas to gain insights from your data
Who This Book Is For
This book is for novice and intermediate level data science professionals and data analysts who want to solve data science problems with a distributed computing framework. Basic experience with data science implementation tasks is expected. Data science professionals looking to skill up and gain an edge in the field will find this book helpful.
What You Will Learn
- Explore the topics of data mining, text mining, Natural Language Processing, information retrieval, and machine learning.
- Solve real-world analytical problems with large data sets.
- Address data science challenges with analytical tools on a distributed system like Spark (apt for iterative algorithms), which offers in-memory processing and more flexibility for data analysis at scale.
- Get hands-on experience with algorithms like Classification, regression, and recommendation on real datasets using Spark MLLib package.
- Learn about numerical and scientific computing using NumPy and SciPy on Spark.
- Use Predictive Model Markup Language (PMML) in Spark for statistical data mining models.
In Detail
Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. Spark's selling point is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. It lets you tackle the complexities that come with raw unstructured data sets with ease.
This guide will get you comfortable and confident performing data science tasks with Spark. You will learn about implementations including distributed deep learning, numerical computing, and scalable machine learning. You will be shown effective solutions to problematic concepts in data science using Spark's data science libraries such as MLLib, Pandas, NumPy, SciPy, and more. These simple and efficient recipes will show you how to implement algorithms and optimize your work.
Style and approach
This book contains a comprehensive range of recipes designed to help you learn the fundamentals and tackle the difficulties of data science. This book outlines practical steps to produce powerful insights into Big Data through a recipe-based approach.
Table of contents
-
Apache Spark for Data Science Cookbook
- Apache Spark for Data Science Cookbook
- Credits
- About the Author
- About the Reviewer
- www.PacktPub.com
- Customer Feedback
- Preface
-
1. Big Data Analytics with Spark
- Introduction
- Initializing SparkContext
- Working with Spark's Python and Scala shells
- Building standalone applications
- Working with the Spark programming model
- Working with pair RDDs
- Persisting RDDs
- Loading and saving data
- Creating broadcast variables and accumulators
- Submitting applications to a cluster
- Working with DataFrames
- Working with Spark Streaming
-
2. Tricky Statistics with Spark
- Introduction
- Variable identification
- Sampling data
- Summary and descriptive statistics
- Generating frequency tables
- Installing Pandas on Linux
- Installing Pandas from source
- Using IPython with PySpark
- Creating Pandas DataFrames over Spark
- Splitting, slicing, sorting, filtering, and grouping DataFrames over Spark
- Implementing co-variance and correlation using Pandas
- Concatenating and merging operations over DataFrames
- Complex operations over DataFrames
- Sparkling Pandas
- 3. Data Analysis with Spark
-
4. Clustering, Classification, and Regression
- Introduction
- Supervised learning
- Unsupervised learning
- Applying regression analysis for sales data
- Variable identification
- Data exploration
- Feature engineering
- Applying linear regression
- Applying logistic regression on bank marketing data
- Variable identification
- Data exploration
- Feature engineering
- Applying logistic regression
- Real-time intrusion detection using streaming k-means
- Variable identification
- Simulating real-time data
- Applying streaming k-means
- 5. Working with Spark MLlib
-
6. NLP with Spark
- Introduction
- Installing NLTK on Linux
- Installing Anaconda on Linux
- Anaconda for cluster management
- POS tagging with PySpark on an Anaconda cluster
- NER with IPython over Spark
- Implementing openNLP - chunker over Spark
- Implementing openNLP - sentence detector over Spark
- Implementing stanford NLP - lemmatization over Spark
- Implementing sentiment analysis using stanford NLP over Spark
- 7. Working with Sparkling Water - H2O
-
8. Data Visualization with Spark
- Introduction
- Visualization using Zeppelin
- Installing Zeppelin
- Customizing Zeppelin's server and websocket port
- Visualizing data on HDFS - parameterizing inputs
- Running custom functions
- Adding external dependencies to Zeppelin
- Pointing to an external Spark Cluster
- Creating scatter plots with Bokeh-Scala
- Creating a time series MultiPlot with Bokeh-Scala
- Creating plots with the lightning visualization server
- Visualize machine learning models with Databricks notebook
- 9. Deep Learning on Spark
-
10. Working with SparkR
- Introduction
- Installing R
- Interactive analysis with the SparkR shell
- Creating a SparkR standalone application from RStudio
- Creating SparkR DataFrames
- SparkR DataFrame operations
- Applying user-defined functions in SparkR
- Running SQL queries from SparkR and caching DataFrames
- Machine learning with SparkR
Product information
- Title: Apache Spark for Data Science Cookbook
- Author(s):
- Release date: December 2016
- Publisher(s): Packt Publishing
- ISBN: 9781785880100
You might also like
book
Apache Spark Deep Learning Cookbook
A solution-based guide to put your deep learning models into production with the power of Apache …
book
Hands-On Deep Learning with Apache Spark
Speed up the design and implementation of deep learning solutions using Apache Spark Key Features Explore …
book
Apache Spark 2.x Cookbook
Over 70 recipes to help you use Apache Spark as your single big data computing platform …
book
Scala and Spark for Big Data Analytics
Harness the power of Scala to program Spark and analyze tonnes of data in the blink …