Book description
Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools.
Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization.
- Build value from your data in a series of agile sprints, using the data-value pyramid
- Extract features for statistical models from a single dataset
- Visualize data with charts, and expose different aspects through interactive reports
- Use historical data to predict the future via classification and regression
- Translate predictions into actions
- Get feedback from users after each sprint to keep your project on track
Publisher resources
Table of contents
- Preface
- I. Setup
- 1. Theory
-
2. Agile Tools
- Scalability = Simplicity
- Agile Data Science Data Processing
- Local Environment Setup
- EC2 Environment Setup
- Getting and Running the Code
-
Touring the Toolset
- Agile Stack Requirements
- Python 3
- Serializing Events with JSON Lines and Parquet
- Collecting Data
- Data Processing with Spark
- Publishing Data with MongoDB
- Searching Data with Elasticsearch
- Distributed Streams with Apache Kafka
- Processing Streams with PySpark Streaming
- Machine Learning with scikit-learn and Spark MLlib
- Scheduling with Apache Airflow (Incubating)
- Reflecting on Our Workflow
- Lightweight Web Applications
- Presenting Our Data
- Conclusion
- 3. Data
- II. Climbing the Pyramid
- 4. Collecting and Displaying Records
- 5. Visualizing Data with Charts and Tables
- 6. Exploring Data with Reports
- 7. Making Predictions
-
8. Deploying Predictive Systems
- Deploying a scikit-learn Application as a Web Service
-
Deploying Spark ML Applications in Batch with Airflow
- Gathering Training Data in Production
- Training, Storing, and Loading Spark ML Models
- Creating Prediction Requests in Mongo
- Fetching Prediction Requests from MongoDB
- Making Predictions in a Batch with Spark ML
- Storing Predictions in MongoDB
- Displaying Batch Prediction Results in Our Web Application
- Automating Our Workflow with Apache Airflow (Incubating)
- Conclusion
- Deploying Spark ML via Spark Streaming
- Conclusion
- 9. Improving Predictions
- A. Manual Installation
- Index
Product information
- Title: Agile Data Science 2.0
- Author(s):
- Release date: June 2017
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491960110
You might also like
book
Managing Data Science
Understand data science concepts and methodologies to manage and deliver top-notch solutions for your organization Key …
book
Practical DataOps: Delivering Agile Data Science at Scale
Gain a practical introduction to DataOps, a new discipline for delivering data science at scale inspired …
book
Building an Effective Data Science Practice: A Framework to Bootstrap and Manage a Successful Data Science Practice
Gain a deep understanding of data science and the thought process needed to solve problems in …
book
Cleaning Data for Effective Data Science
Think about your data intelligently and ask the right questions Key Features Master data cleaning techniques …