Book description
A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master the discipline
Finding a good data scientist has been likened to hunting for a unicorn: the required combination of technical skills is simply very hard to find in one person. In addition, good data science is not just rote application of trainable skill sets; it requires the ability to think flexibly about all these areas and understand the connections between them. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.
Unlike many analytics books, computer science and software engineering are given extensive coverage since they play such a central role in the daily work of a data scientist. The author also describes classic machine learning algorithms, from their mathematical foundations to real-world applications. Visualization tools are reviewed, and their central importance in data science is highlighted. Classical statistics is addressed to help readers think critically about the interpretation of data and its common pitfalls. The clear communication of technical results, which is perhaps the most undertrained of data science skills, is given its own chapter, and all topics are explained in the context of solving real-world data problems. The book also features:
• Extensive sample code and tutorials using Python™ along with its technical libraries
• Core technologies of “Big Data,” including their strengths and limitations and how they can be used to solve real-world problems
• Coverage of the practical realities of the tools, keeping theory to a minimum; however, when theory is presented, it is done in an intuitive way to encourage critical thinking and creativity
• A wide variety of case studies from industry
• Practical advice on the realities of being a data scientist today, including the overall workflow, where time is spent, the types of datasets worked on, and the skill sets needed
The Data Science Handbook is an ideal resource for data analysis methodology and big data software tools. The book is appropriate for people who want to practice data science, but lack the required skill sets. This includes software professionals who need to better understand analytics and statisticians who need to understand software. Modern data science is a unified discipline, and it is presented as such. This book is also an appropriate reference for researchers and entry-level graduate students who need to learn real-world analytics and expand their skill set.
FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature. He has also worked at Google and several Big Data startups. He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.
Table of contents
- Cover
- Title Page
- Preface
-
Part I: The Stuff You'll Always Use
- Chapter 2: The Data Science Road Map
- Chapter 3: Programming Languages
- Interlude: My Personal Toolkit
- Chapter 4: Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
-
Chapter 5: Visualizations and Simple Metrics
- 5.1 A Note on Python's Visualization Tools
- 5.2 Example Code
- 5.3 Pie Charts
- 5.4 Bar Charts
- 5.5 Histograms
- 5.6 Means, Standard Deviations, Medians, and Quantiles
- 5.7 Boxplots
- 5.8 Scatterplots
- 5.9 Scatterplots with Logarithmic Axes
- 5.10 Scatter Matrices
- 5.11 Heatmaps
- 5.12 Correlations
- 5.13 Anscombe's Quartet and the Limits of Numbers
- 5.14 Time Series
- 5.15 Further Reading
- 5.16 Glossary
- Chapter 6: Machine Learning Overview
- Chapter 7: Interlude: Feature Extraction Ideas
- Chapter 8: Machine Learning Classification
- Chapter 9: Technical Communication and Documentation
-
Part II: Stuff You Still Need to Know
- Chapter 10: Unsupervised Learning: Clustering and Dimensionality Reduction
- Chapter 11: Regression
-
Chapter 12: Data Encodings and File Formats
- 12.1 Typical File Format Categories
- 12.2 CSV Files
- 12.3 JSON Files
- 12.4 XML Files
- 12.5 HTML Files
- 12.6 Tar Files
- 12.7 GZip Files
- 12.8 Zip Files
- 12.9 Image Files: Rasterized, Vectorized, and/or Compressed
- 12.10 It's All Bytes at the End of the Day
- 12.11 Integers
- 12.12 Floats
- 12.13 Text Data
- 12.14 Further Reading
- 12.15 Glossary
-
Chapter 13: Big Data
- 13.1 What Is Big Data?
- 13.2 Hadoop: The File System and the Processor
- 13.3 Using HDFS
- 13.4 Example PySpark Script
- 13.5 Spark Overview
- 13.6 Spark Operations
- 13.7 Two Ways to Run PySpark
- 13.8 Configuring Spark
- 13.9 Under the Hood
- 13.10 Spark Tips and Gotchas
- 13.11 The MapReduce Paradigm
- 13.12 Performance Considerations
- 13.13 Further Reading
- 13.14 Glossary
- Chapter 14: Databases
- Chapter 15: Software Engineering Best Practices
-
Chapter 16: Natural Language Processing
- 16.1 Do I Even Need NLP?
- 16.2 The Great Divide: Language versus Statistics
- 16.3 Example: Sentiment Analysis on Stock Market Articles
- 16.4 Software and Datasets
- 16.5 Tokenization
- 16.6 Central Concept: Bag-of-Words
- 16.7 Word Weighting: TF-IDF
- 16.8 n-Grams
- 16.9 Stop Words
- 16.10 Lemmatization and Stemming
- 16.11 Synonyms
- 16.12 Part of Speech Tagging
- 16.13 Common Problems
- 16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding
- 16.15 Further Reading
- 16.16 Glossary
-
Chapter 17: Time Series Analysis
- 17.1 Example: Predicting Wikipedia Page Views
- 17.2 A Typical Workflow
- 17.3 Time Series versus Time-Stamped Events
- 17.4 Resampling an Interpolation
- 17.5 Smoothing Signals
- 17.6 Logarithms and Other Transformations
- 17.7 Trends and Periodicity
- 17.8 Windowing
- 17.9 Brainstorming Simple Features
- 17.10 Better Features: Time Series as Vectors
- 17.11 Fourier Analysis: Sometimes a Magic Bullet
- 17.12 Time Series in Context: The Whole Suite of Features
- 17.13 Further Reading
- 17.14 Glossary
-
Chapter 18: Probability
- 18.1 Flipping Coins: Bernoulli Random Variables
- 18.2 Throwing Darts: Uniform Random Variables
- 18.3 The Uniform Distribution and Pseudorandom Numbers
- 18.4 Nondiscrete, Noncontinuous Random Variables
- 18.5 Notation, Expectations, and Standard Deviation
- 18.6 Dependence, Marginal and Conditional Probability
- 18.7 Understanding the Tails
- 18.8 Binomial Distribution
- 18.9 Poisson Distribution
- 18.10 Normal Distribution
- 18.11 Multivariate Gaussian
- 18.12 Exponential Distribution
- 18.13 Log-Normal Distribution
- 18.14 Entropy
- 18.15 Further Reading
- 18.16 Glossary
-
Chapter 19: Statistics
- 19.1 Statistics in Perspective
- 19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies
- 19.3 Hypothesis Testing: Key Idea and Example
- 19.4 Multiple Hypothesis Testing
- 19.5 Parameter Estimation
- 19.6 Hypothesis Testing: t-Test
- 19.7 Confidence Intervals
- 19.8 Bayesian Statistics
- 19.9 Naive Bayesian Statistics
- 19.10 Bayesian Networks
- 19.11 Choosing Priors: Maximum Entropy or Domain Knowledge
- 19.12 Further Reading
- 19.13 Glossary
- Chapter 20: Programming Language Concepts
-
Chapter 21: Performance and Computer Memory
- 21.1 Example Script
- 21.2 Algorithm Performance and Big-O Notation
- 21.3 Some Classic Problems: Sorting a List and Binary Search
- 21.4 Amortized Performance and Average Performance
- 21.5 Two Principles: Reducing Overhead and Managing Memory
- 21.6 Performance Tip: Use Numerical Libraries When Applicable
- 21.7 Performance Tip: Delete Large Structures You Don't Need
- 21.8 Performance Tip: Use Built-In Functions When Possible
- 21.9 Performance Tip: Avoid Superfluous Function Calls
- 21.10 Performance Tip: Avoid Creating Large New Objects
- 21.11 Further Reading
- 21.12 Glossary
-
Part III: Specialized or Advanced Topics
- Chapter 22: Computer Memory and Data Structures
- Chapter 23: Maximum Likelihood Estimation and Optimization
-
Chapter 24: Advanced Classifiers
- 24.1 A Note on Libraries
- 24.2 Basic Deep Learning
- 24.3 Convolutional Neural Networks
- 24.4 Different Types of Layers. What the Heck Is a Tensor?
- 24.5 Example: The MNIST Handwriting Dataset
- 24.6 Recurrent Neural Networks
- 24.7 Bayesian Networks
- 24.8 Training and Prediction
- 24.9 Markov Chain Monte Carlo
- 24.10 PyMC Example
- 24.11 Further Reading
- 24.12 Glossary
-
Chapter 25: Stochastic Modeling
- 25.1 Markov Chains
- 25.2 Two Kinds of Markov Chain, Two Kinds of Questions
- 25.3 Markov Chain Monte Carlo
- 25.4 Hidden Markov Models and the Viterbi Algorithm
- 25.5 The Viterbi Algorithm
- 25.6 Random Walks
- 25.7 Brownian Motion
- 25.8 ARIMA Models
- 25.9 Continuous-Time Markov Processes
- 25.10 Poisson Processes
- 25.11 Further Reading
- 25.12 Glossary
- Parting Words: Your Future as a Data Scientist
- Index
- End User License Agreement
Product information
- Title: The Data Science Handbook
- Author(s):
- Release date: February 2017
- Publisher(s): Wiley
- ISBN: 9781119092940
You might also like
book
Data Science, 2nd Edition
Learn the basics of Data Science through an easy to understand conceptual framework and immediately practice …
book
Doing Data Science
Now that people are aware that data can make the difference in an election or a …
book
Data Science for Business
Written by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces …
book
Analytical Skills for AI and Data Science
While several market-leading companies have successfully transformed their business models by following data- and AI-driven paths, …