Book description
Understand data analysis pipelines using machine learning algorithms and techniques with this practical guide
Key Features
- Prepare and clean your data to use it for exploratory analysis, data manipulation, and data wrangling
- Discover supervised, unsupervised, probabilistic, and Bayesian machine learning methods
- Get to grips with graph processing and sentiment analysis
Book Description
Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you'll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines.
Starting with the essential statistical and data analysis fundamentals using Python, you'll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You'll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you'll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you'll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask.
By the end of this data analysis book, you'll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.
What you will learn
- Explore data science and its various process models
- Perform data manipulation using NumPy and pandas for aggregating, cleaning, and handling missing values
- Create interactive visualizations using Matplotlib, Seaborn, and Bokeh
- Retrieve, process, and store data in a wide range of formats
- Understand data preprocessing and feature engineering using pandas and scikit-learn
- Perform time series analysis and signal processing using sunspot cycle data
- Analyze textual data and image data to perform advanced analysis
- Get up to speed with parallel computing using Dask
Who this book is for
This book is for data analysts, business analysts, statisticians, and data scientists looking to learn how to use Python for data analysis. Students and academic faculties will also find this book useful for learning and teaching Python data analysis using a hands-on approach. A basic understanding of math and working knowledge of the Python programming language will help you get started with this book.
Table of contents
- Title Page
- Copyright and Credits
- About Packt
- Contributors
- Preface
- Section 1: Foundation for Data Analysis
-
Getting Started with Python Libraries
- Understanding data analysis
- The standard process of data analysis
- The KDD process
- SEMMA
- CRISP-DM
- Comparing data analysis and data science
- The roles of data analysts and data scientists
- The skillsets of data analysts and data scientists
- Installing Python 3
- Python installation and setup on Windows
- Python installation and setup on Linux
- Python installation and setup on Mac OS X with a GUI installer
- Python installation and setup on Mac OS X with brew
- Software used in this book
- Using IPython as a shell
- Reading manual pages
- Where to find help and references to Python data analysis libraries
- Using JupyterLab
- Using Jupyter Notebooks
- Advanced features of Jupyter Notebooks
- Keyboard shortcuts
- Installing other kernels
- Running shell commands
- Extensions for Notebook
- Summary
-
NumPy and pandas
- Technical requirements
- Understanding NumPy arrays
- Array features
- Selecting array elements
- NumPy array numerical data types
- dtype objects
- Data type character codes
- dtype constructors
- dtype attributes
- Manipulating array shapes
- The stacking of NumPy arrays
- Partitioning NumPy arrays
- Changing the data type of NumPy arrays
- Creating NumPy views and copies
- Slicing NumPy arrays
- Boolean and fancy indexing
- Broadcasting arrays
- Creating pandas DataFrames
- Understanding pandas Series
- Reading and querying the Quandl data
- Describing pandas DataFrames
- Grouping and joining pandas DataFrame
- Working with missing values
- Creating pivot tables
- Dealing with dates
- Summary
- References
-
Statistics
- Technical requirements
- Understanding attributes and their types
- Types of attributes
- Discrete and continuous attributes
- Measuring central tendency
- Mean
- Mode
- Median
- Measuring dispersion
- Skewness and kurtosis
- Understanding relationships using covariance and correlation coefficients
- Pearson's correlation coefficient
- Spearman's rank correlation coefficient
- Kendall's rank correlation coefficient
- Central limit theorem
- Collecting samples
- Performing parametric tests
- Performing non-parametric tests
- Summary
-
Linear Algebra
- Technical requirements
- Fitting to polynomials with NumPy
- Determinant
- Finding the rank of a matrix
- Matrix inverse using NumPy
- Solving linear equations using NumPy
- Decomposing a matrix using SVD
- Eigenvectors and Eigenvalues using NumPy
- Generating random numbers
- Binomial distribution
- Normal distribution
- Testing normality of data using SciPy
- Creating a masked array using the numpy.ma subpackage
- Summary
- Section 2: Exploratory Data Analysis and Data Cleaning
-
Data Visualization
- Technical requirements
- Visualization using Matplotlib
- Accessories for charts
- Scatter plot
- Line plot
- Pie plot
- Bar plot
- Histogram plot
- Bubble plot
- pandas plotting
- Advanced visualization using the Seaborn package
- lm plots
- Bar plots
- Distribution plots
- Box plots
- KDE plots
- Violin plots
- Count plots
- Joint plots
- Heatmaps
- Pair plots
- Interactive visualization with Bokeh
- Plotting a simple graph
- Glyphs
- Layouts
- Nested layout using row and column layouts
- Multiple plots
- Interactions
- Hide click policy
- Mute click policy
- Annotations
- Hover tool
- Widgets
- Tab panel
- Slider
- Summary
-
Retrieving, Processing, and Storing Data
- Technical requirements
- Reading and writing CSV files with NumPy
- Reading and writing CSV files with pandas
- Reading and writing data from Excel
- Reading and writing data from JSON
- Reading and writing data from HDF5
- Reading and writing data from HTML tables
- Reading and writing data from Parquet
- Reading and writing data from a pickle pandas object
- Lightweight access with sqllite3
- Reading and writing data from MySQL
- Inserting a whole DataFrame into the database
- Reading and writing data from MongoDB
- Reading and writing data from Cassandra
- Reading and writing data from Redis
- PonyORM
- Summary
-
Cleaning Messy Data
- Technical requirements
- Exploring data
- Filtering data to weed out the noise
- Column-wise filtration
- Row-wise filtration
- Handling missing values
- Dropping missing values
- Filling in a missing value
- Handling outliers
- Feature encoding techniques
- One-hot encoding
- Label encoding
- Ordinal encoder
- Feature scaling
- Methods for feature scaling
- Feature transformation
- Feature splitting
- Summary
- Signal Processing and Time Series
- Section 3: Deep Dive into Machine Learning
-
Supervised Learning - Regression Analysis
- Technical requirements
- Linear regression
- Multiple linear regression
- Understanding multicollinearity
- Removing multicollinearity
- Dummy variables
- Developing a linear regression model
- Evaluating regression model performance
- R-squared
- MSE
- MAE
- RMSE
- Fitting polynomial regression
- Regression models for classification
- Logistic regression
- Characteristics of the logistic regression model
- Types of logistic regression algorithms
- Advantages and disadvantages of logistic regression
- Implementing logistic regression using scikit-learn
- Summary
-
Supervised Learning - Classification Techniques
- Technical requirements
- Classification
- Naive Bayes classification
- Decision tree classification
- KNN classification
- SVM classification
- Terminology
- Splitting training and testing sets
- Holdout
- K-fold cross-validation
- Bootstrap method
- Evaluating the classification model performance
- Confusion matrix
- Accuracy
- Precision
- Recall
- F-measure
- ROC curve and AUC
- Summary
-
Unsupervised Learning - PCA and Clustering
- Technical requirements
- Unsupervised learning
- Reducing the dimensionality of data
- PCA
- Performing PCA
- Clustering
- Finding the number of clusters
- The elbow method
- The silhouette method
- Partitioning data using k-means clustering
- Hierarchical clustering
- DBSCAN clustering
- Spectral clustering
- Evaluating clustering performance
- Internal performance evaluation
- The Davies-Bouldin index
- The silhouette coefficient
- External performance evaluation
- The Rand score
- The Jaccard score
- F-Measure or F1-score
- The Fowlkes-Mallows score
- Summary
- Section 4: NLP, Image Analytics, and Parallel Computing
-
Analyzing Textual Data
- Technical requirements
- Installing NLTK and SpaCy
- Text normalization
- Tokenization
- Removing stopwords
- Stemming and lemmatization
- POS tagging
- Recognizing entities
- Dependency parsing
- Creating a word cloud
- Bag of Words
- TF-IDF
- Sentiment analysis using text classification
- Classification using BoW
- Classification using TF-IDF
- Text similarity
- Jaccard similarity
- Cosine similarity
- Summary
- Analyzing Image Data
-
Parallel Computing Using Dask
- Parallel computing using Dask
- Dask data types
- Dask Arrays
- Dask DataFrames
- DataFrame Indexing
- Filter data
- Groupby
- Converting a pandas DataFrame into a Dask DataFrame
- Converting a Dask DataFrame into a pandas DataFrame
- Dask Bags
- Creating a Dask Bag using Python iterable items
- Creating a Dask Bag using a text file
- Storing a Dask Bag in a text file
- Storing a Dask Bag in a DataFrame
- Dask Delayed
- Preprocessing data at scale
- Feature scaling in Dask
- Feature encoding in Dask
- Machine learning at scale
- Parallel computing using scikit-learn
- Reimplementing ML algorithms for Dask
- Logistic regression
- Clustering
- Summary
- Other Books You May Enjoy
Product information
- Title: Python Data Analysis - Third Edition
- Author(s):
- Release date: February 2021
- Publisher(s): Packt Publishing
- ISBN: 9781789955248
You might also like
book
Python for Data Analysis, 2nd Edition
Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, …
book
Hands-On Exploratory Data Analysis with Python
Discover techniques to summarize the characteristics of your data using PyPlot, NumPy, SciPy, and pandas Key …
book
Python for Geospatial Data Analysis
In spatial data science, things in closer proximity to one another likely have more in common …
book
Python Machine Learning - Third Edition
Applied machine learning with a solid foundation in theory. Revised and expanded for TensorFlow 2, GANs, …