Book description
Over 140 practical recipes to help you make sense of your data with ease and build production-ready data apps
About This Book
- Analyze Big Data sets, create attractive visualizations, and manipulate and process various data types
- Packed with rich recipes to help you learn and explore amazing algorithms for statistics and machine learning
- Authored by Ivan Idris, expert in python programming and proud author of eight highly reviewed books
Who This Book Is For
This book teaches Python data analysis at an intermediate level with the goal of transforming you from journeyman to master. Basic Python and data analysis skills and affinity are assumed.
What You Will Learn
- Set up reproducible data analysis
- Clean and transform data
- Apply advanced statistical analysis
- Create attractive data visualizations
- Web scrape and work with databases, Hadoop, and Spark
- Analyze images and time series data
- Mine text and analyze social networks
- Use machine learning and evaluate the results
- Take advantage of parallelism and concurrency
In Detail
Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning.
Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You'll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining.
In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code.
By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.
Style and Approach
The book is written in ?cookbook? style striving for high realism in data analysis. Through the recipe-based format, you can read each recipe separately as required and immediately apply the knowledge gained.
Table of contents
-
Python Data Analysis Cookbook
- Table of Contents
- Python Data Analysis Cookbook
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Preface
-
1. Laying the Foundation for Reproducible Data Analysis
- Introduction
- Setting up Anaconda
- Installing the Data Science Toolbox
- Creating a virtual environment with virtualenv and virtualenvwrapper
- Sandboxing Python applications with Docker images
- Keeping track of package versions and history in IPython Notebook
- Configuring IPython
- Learning to log for robust error checking
- Unit testing your code
- Configuring pandas
- Configuring matplotlib
- Seeding random number generators and NumPy print options
- Standardizing reports, code style, and data access
-
2. Creating Attractive Data Visualizations
- Introduction
- Graphing Anscombe's quartet
- Choosing seaborn color palettes
- Choosing matplotlib color maps
- Interacting with IPython Notebook widgets
- Viewing a matrix of scatterplots
- Visualizing with d3.js via mpld3
- Creating heatmaps
- Combining box plots and kernel density plots with violin plots
- Visualizing network graphs with hive plots
- Displaying geographical maps
- Using ggplot2-like plots
- Highlighting data points with influence plots
-
3. Statistical Data Analysis and Probability
- Introduction
- Fitting data to the exponential distribution
- Fitting aggregated data to the gamma distribution
- Fitting aggregated counts to the Poisson distribution
- Determining bias
- Estimating kernel density
- Determining confidence intervals for mean, variance, and standard deviation
- Sampling with probability weights
- Exploring extreme values
- Correlating variables with Pearson's correlation
- Correlating variables with the Spearman rank correlation
- Correlating a binary and a continuous variable with the point biserial correlation
- Evaluating relations between variables with ANOVA
-
4. Dealing with Data and Numerical Issues
- Introduction
- Clipping and filtering outliers
- Winsorizing data
- Measuring central tendency of noisy data
- Normalizing with the Box-Cox transformation
- Transforming data with the power ladder
- Transforming data with logarithms
- Rebinning data
- Applying logit() to transform proportions
- Fitting a robust linear model
- Taking variance into account with weighted least squares
- Using arbitrary precision for optimization
- Using arbitrary precision for linear algebra
-
5. Web Mining, Databases, and Big Data
- Introduction
- Simulating web browsing
- Scraping the Web
- Dealing with non-ASCII text and HTML entities
- Implementing association tables
- Setting up database migration scripts
- Adding a table column to an existing table
- Adding indices after table creation
- Setting up a test web server
- Implementing a star schema with fact and dimension tables
- Using HDFS
- Setting up Spark
- Clustering data with Spark
-
6. Signal Processing and Timeseries
- Introduction
- Spectral analysis with periodograms
- Estimating power spectral density with the Welch method
- Analyzing peaks
- Measuring phase synchronization
- Exponential smoothing
- Evaluating smoothing
- Using the Lomb-Scargle periodogram
- Analyzing the frequency spectrum of audio
- Analyzing signals with the discrete cosine transform
- Block bootstrapping time series data
- Moving block bootstrapping time series data
- Applying the discrete wavelet transform
-
7. Selecting Stocks with Financial Data Analysis
- Introduction
- Computing simple and log returns
- Ranking stocks with the Sharpe ratio and liquidity
- Ranking stocks with the Calmar and Sortino ratios
- Analyzing returns statistics
- Correlating individual stocks with the broader market
- Exploring risk and return
- Examining the market with the non-parametric runs test
- Testing for random walks
- Determining market efficiency with autoregressive models
- Creating tables for a stock prices database
- Populating the stock prices database
- Optimizing an equal weights two-asset portfolio
-
8. Text Mining and Social Network Analysis
- Introduction
- Creating a categorized corpus
- Tokenizing news articles in sentences and words
- Stemming, lemmatizing, filtering, and TF-IDF scores
- Recognizing named entities
- Extracting topics with non-negative matrix factorization
- Implementing a basic terms database
- Computing social network density
- Calculating social network closeness centrality
- Determining the betweenness centrality
- Estimating the average clustering coefficient
- Calculating the assortativity coefficient of a graph
- Getting the clique number of a graph
- Creating a document graph with cosine similarity
-
9. Ensemble Learning and Dimensionality Reduction
- Introduction
- Recursively eliminating features
- Applying principal component analysis for dimension reduction
- Applying linear discriminant analysis for dimension reduction
- Stacking and majority voting for multiple models
- Learning with random forests
- Fitting noisy data with the RANSAC algorithm
- Bagging to improve results
- Boosting for better learning
- Nesting cross-validation
- Reusing models with joblib
- Hierarchically clustering data
- Taking a Theano tour
-
10. Evaluating Classifiers, Regressors, and Clusters
- Introduction
- Getting classification straight with the confusion matrix
- Computing precision, recall, and F1-score
- Examining a receiver operating characteristic and the area under a curve
- Visualizing the goodness of fit
- Computing MSE and median absolute error
- Evaluating clusters with the mean silhouette coefficient
- Comparing results with a dummy classifier
- Determining MAPE and MPE
- Comparing with a dummy regressor
- Calculating the mean absolute error and the residual sum of squares
- Examining the kappa of classification
- Taking a look at the Matthews correlation coefficient
-
11. Analyzing Images
- Introduction
- Setting up OpenCV
- Applying Scale-Invariant Feature Transform (SIFT)
- Detecting features with SURF
- Quantizing colors
- Denoising images
- Extracting patches from an image
- Detecting faces with Haar cascades
- Searching for bright stars
- Extracting metadata from images
- Extracting texture features from images
- Applying hierarchical clustering on images
- Segmenting images with spectral clustering
-
12. Parallelism and Performance
- Introduction
- Just-in-time compiling with Numba
- Speeding up numerical expressions with Numexpr
- Running multiple threads with the threading module
- Launching multiple tasks with the concurrent.futures module
- Accessing resources asynchronously with the asyncio module
- Distributed processing with execnet
- Profiling memory usage
- Calculating the mean, variance, skewness, and kurtosis on the fly
- Caching with a least recently used cache
- Caching HTTP requests
- Streaming counting with the Count-min sketch
- Harnessing the power of the GPU with OpenCL
- A. Glossary
- B. Function Reference
- C. Online Resources
- D. Tips and Tricks for Command-Line and Miscellaneous Tools
- Index
Product information
- Title: Python Data Analysis Cookbook
- Author(s):
- Release date: July 2016
- Publisher(s): Packt Publishing
- ISBN: 9781785282287
You might also like
book
Python: End-to-end Data Analysis
Leverage the power of Python to clean, scrape, analyze, and visualize your data About This Book …
book
Python Data Analysis - Second Edition
Learn how to apply powerful data analysis techniques with popular open source Python modules About This …
book
Python Machine Learning Cookbook - Second Edition
Discover powerful ways to effectively solve real-world machine learning problems using key libraries including scikit-learn, TensorFlow, …
book
Python: Data Analytics and Visualization
Understand, evaluate, and visualize data About This Book Learn basic steps of data analysis and how …