Book description
In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming.
You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance.
If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.
With this book, you will:
- Familiarize yourself with the Spark programming model
- Become comfortable within the Spark ecosystem
- Learn general approaches in data science
- Examine complete implementations that analyze large public data sets
- Discover which machine learning tools make sense for particular problems
- Acquire code that can be adapted to many uses
Publisher resources
Table of contents
- Foreword
- Preface
- 1. Analyzing Big Data
-
2. Introduction to Data Analysis with Scala and Spark
- Scala for Data Scientists
- The Spark Programming Model
- Record Linkage
- Getting Started: The Spark Shell and SparkContext
- Bringing Data from the Cluster to the Client
- Shipping Code from the Client to the Cluster
- From RDDs to Data Frames
- Analyzing Data with the DataFrame API
- Fast Summary Statistics for DataFrames
- Pivoting and Reshaping DataFrames
- Joining DataFrames and Selecting Features
- Preparing Models for Production Environments
- Model Evaluation
- Where to Go from Here
- 3. Recommending Music and the Audioscrobbler Data Set
- 4. Predicting Forest Cover with Decision Trees
- 5. Anomaly Detection in Network Traffic with K-means Clustering
-
6. Understanding Wikipedia with Latent Semantic Analysis
- The Document-Term Matrix
- Getting the Data
- Parsing and Preparing the Data
- Lemmatization
- Computing the TF-IDFs
- Singular Value Decomposition
- Finding Important Concepts
- Querying and Scoring with a Low-Dimensional Representation
- Term-Term Relevance
- Document-Document Relevance
- Document-Term Relevance
- Multiple-Term Queries
- Where to Go from Here
-
7. Analyzing Co-Occurrence Networks with GraphX
- The MEDLINE Citation Index: A Network Analysis
- Getting the Data
- Parsing XML Documents with Scalaâs XML Library
- Analyzing the MeSH Major Topics and Their Co-Occurrences
- Constructing a Co-Occurrence Network with GraphX
- Understanding the Structure of Networks
- Filtering Out Noisy Edges
- Small-World Networks
- Where to Go from Here
- 8. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data
- 9. Estimating Financial Risk Through Monte Carlo Simulation
- 10. Analyzing Genomics Data and the BDG Project
- 11. Analyzing Neuroimaging Data with PySpark and Thunder
- Index
Product information
- Title: Advanced Analytics with Spark, 2nd Edition
- Author(s):
- Release date: June 2017
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491972908
You might also like
book
Advanced Analytics with PySpark
The amount of data being generated today is staggering and growing. Apache Spark has emerged as …
book
Data Analysis with Python and PySpark
Think big about your data! PySpark brings the powerful Spark big data processing engine to the …
book
Data Science from Scratch, 2nd Edition
To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, …
book
Learning Spark, 2nd Edition
Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to …