Book description
Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.
In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.
Topics include:
- Statistical inference, exploratory data analysis, and the data science process
- Algorithms
- Spam filters, Naive Bayes, and data wrangling
- Logistic regression
- Financial modeling
- Recommendation engines and causality
- Data visualization
- Social networks and data journalism
- Data engineering, MapReduce, Pregel, and Hadoop
Doing Data Science is collaboration between course instructor Rachel Schutt, Senior VP of Data Science at News Corp, and data science consultant Cathy O’Neil, a senior data scientist at Johnson Research Labs, who attended and blogged about the course.
Publisher resources
Table of contents
-
Preface
- Motivation
- Origins of the Class
- Origins of the Book
- What to Expect from This Book
- How This Book Is Organized
- How to Read This Book
- How Code Is Used in This Book
- Who This Book Is For
- Prerequisites
- Supplemental Reading
- About the Contributors
- Conventions Used in This Book
- Using Code Examples
- O’Reilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Introduction: What Is Data Science?
- 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
- 3. Algorithms
- 4. Spam Filters, Naive Bayes, and Wrangling
- 5. Logistic Regression
-
6. Time Stamps and Financial Modeling
- Kyle Teague and GetGlue
- Timestamps
- Cathy O’Neil
- Thought Experiment
-
Financial Modeling
- In-Sample, Out-of-Sample, and Causality
- Preparing Financial Data
- Log Returns
- Example: The S&P Index
- Working out a Volatility Measurement
- Exponential Downweighting
- The Financial Modeling Feedback Loop
- Why Regression?
- Adding Priors
- A Baby Model
- Exercise: GetGlue and Timestamped Event Data
- Exercise: Financial Data
- 7. Extracting Meaning from Data
-
8. Recommendation Engines: Building a User-Facing Data Product at Scale
-
A Real-World Recommendation Engine
- Nearest Neighbor Algorithm Review
- Some Problems with Nearest Neighbors
- Beyond Nearest Neighbor: Machine Learning Classification
- The Dimensionality Problem
- Singular Value Decomposition (SVD)
- Important Properties of SVD
- Principal Component Analysis (PCA)
- Alternating Least Squares
- Fix V and Update U
- Last Thoughts on These Algorithms
- Thought Experiment: Filter Bubbles
- Exercise: Build Your Own Recommendation System
-
A Real-World Recommendation Engine
- 9. Data Visualization and Fraud Detection
- 10. Social Networks and Data Journalism
- 11. Causality
- 12. Epidemiology
- 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
- 14. Data Engineering: MapReduce, Pregel, and Hadoop
- 15. The Students Speak
- 16. Next-Generation Data Scientists, Hubris, and Ethics
- Index
Product information
- Title: Doing Data Science
- Author(s):
- Release date: October 2013
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781449358655
You might also like
book
Data Science: The Hard Parts
This practical guide provides a collection of techniques and best practices that are generally overlooked in …
book
Practical Statistics for Data Scientists
Statistical methods are a key part of of data science, yet very few data scientists have …
book
The Data Science Handbook
A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master …
book
Learning Data Science
As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it's …