Python Data Analysis - Third Edition

Book description

Understand data analysis pipelines using machine learning algorithms and techniques with this practical guide

Key Features

  • Prepare and clean your data to use it for exploratory analysis, data manipulation, and data wrangling
  • Discover supervised, unsupervised, probabilistic, and Bayesian machine learning methods
  • Get to grips with graph processing and sentiment analysis

Book Description

Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you'll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines.

Starting with the essential statistical and data analysis fundamentals using Python, you'll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You'll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you'll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you'll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask.

By the end of this data analysis book, you'll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.

What you will learn

  • Explore data science and its various process models
  • Perform data manipulation using NumPy and pandas for aggregating, cleaning, and handling missing values
  • Create interactive visualizations using Matplotlib, Seaborn, and Bokeh
  • Retrieve, process, and store data in a wide range of formats
  • Understand data preprocessing and feature engineering using pandas and scikit-learn
  • Perform time series analysis and signal processing using sunspot cycle data
  • Analyze textual data and image data to perform advanced analysis
  • Get up to speed with parallel computing using Dask

Who this book is for

This book is for data analysts, business analysts, statisticians, and data scientists looking to learn how to use Python for data analysis. Students and academic faculties will also find this book useful for learning and teaching Python data analysis using a hands-on approach. A basic understanding of math and working knowledge of the Python programming language will help you get started with this book.

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Python Data Analysis Third Edition
  3. About Packt
    1. Why subscribe?
  4. Contributors
    1. About the authors
    2. About the reviewers
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Reviews
  6. Section 1: Foundation for Data Analysis
  7. Getting Started with Python Libraries
    1. Understanding data analysis
    2. The standard process of data analysis
    3. The KDD process
    4. SEMMA 
    5. CRISP-DM
    6. Comparing data analysis and data science
    7. The roles of data analysts and data scientists
    8. The skillsets of data analysts and data scientists
    9. Installing Python 3
    10. Python installation and setup on Windows
    11. Python installation and setup on Linux
    12. Python installation and setup on Mac OS X with a GUI installer
    13. Python installation and setup on Mac OS X with brew
    14. Software used in this book
    15. Using IPython as a shell
    16. Reading manual pages
    17. Where to find help and references to Python data analysis libraries
    18. Using JupyterLab
    19. Using Jupyter Notebooks
    20. Advanced features of Jupyter Notebooks
    21. Keyboard shortcuts
    22. Installing other kernels
    23. Running shell commands
    24. Extensions for Notebook
    25. Summary
  8. NumPy and pandas
    1. Technical requirements
    2. Understanding NumPy arrays
    3. Array features
    4. Selecting array elements
    5. NumPy array numerical data types
    6. dtype objects
    7. Data type character codes
    8. dtype constructors
    9. dtype attributes
    10. Manipulating array shapes
    11. The stacking of NumPy arrays
    12. Partitioning NumPy arrays
    13. Changing the data type of NumPy arrays
    14. Creating NumPy views and copies
    15. Slicing NumPy arrays
    16. Boolean and fancy indexing
    17. Broadcasting arrays
    18. Creating pandas DataFrames
    19. Understanding pandas Series
    20. Reading and querying the Quandl data
    21. Describing pandas DataFrames
    22. Grouping and joining pandas DataFrame
    23. Working with missing values
    24. Creating pivot tables
    25. Dealing with dates
    26. Summary
    27. References
  9. Statistics
    1. Technical requirements
    2. Understanding attributes and their types
    3. Types of attributes
    4. Discrete and continuous attributes
    5. Measuring central tendency
    6. Mean
    7. Mode
    8. Median
    9. Measuring dispersion
    10. Skewness and kurtosis
    11. Understanding relationships using covariance and correlation coefficients
    12. Pearson's correlation coefficient
    13. Spearman's rank correlation coefficient
    14. Kendall's rank correlation coefficient
    15. Central limit theorem
    16. Collecting samples
    17. Performing parametric tests
    18. Performing non-parametric tests 
    19. Summary
  10. Linear Algebra
    1. Technical requirements
    2. Fitting to polynomials with NumPy
    3. Determinant
    4. Finding the rank of a matrix
    5. Matrix inverse using NumPy
    6. Solving linear equations using NumPy
    7. Decomposing a matrix using SVD
    8. Eigenvectors and Eigenvalues using NumPy
    9. Generating random numbers
    10. Binomial distribution
    11. Normal distribution
    12. Testing normality of data using SciPy
    13. Creating a masked array using the numpy.ma subpackage
    14. Summary
  11. Section 2: Exploratory Data Analysis and Data Cleaning
  12. Data Visualization
    1. Technical requirements
    2. Visualization using Matplotlib
    3. Accessories for charts
    4. Scatter plot
    5. Line plot
    6. Pie plot
    7. Bar plot
    8. Histogram plot
    9. Bubble plot
    10. pandas plotting
    11. Advanced visualization using the Seaborn package
    12. lm plots
    13. Bar plots
    14. Distribution plots
    15. Box plots
    16. KDE plots
    17. Violin plots
    18. Count plots
    19. Joint plots
    20. Heatmaps
    21. Pair plots
    22. Interactive visualization with Bokeh
    23. Plotting a simple graph
    24. Glyphs
    25. Layouts
    26. Nested layout using row and column layouts
    27. Multiple plots
    28. Interactions
    29. Hide click policy
    30. Mute click policy
    31. Annotations
    32. Hover tool
    33. Widgets
    34. Tab panel
    35. Slider
    36. Summary
  13. Retrieving, Processing, and Storing Data
    1. Technical requirements
    2. Reading and writing CSV files with NumPy
    3. Reading and writing CSV files with pandas
    4. Reading and writing data from Excel
    5. Reading and writing data from JSON
    6. Reading and writing data from HDF5
    7. Reading and writing data from HTML tables
    8. Reading and writing data from Parquet
    9. Reading and writing data from a pickle pandas object
    10. Lightweight access with sqllite3
    11. Reading and writing data from MySQL
    12. Inserting a whole DataFrame into the database
    13. Reading and writing data from MongoDB
    14. Reading and writing data from Cassandra
    15. Reading and writing data from Redis
    16. PonyORM
    17. Summary
  14. Cleaning Messy Data
    1. Technical requirements
    2. Exploring data
    3. Filtering data to weed out the noise
    4. Column-wise filtration  
    5. Row-wise filtration  
    6. Handling missing values
    7. Dropping missing values
    8. Filling in a missing value
    9. Handling outliers
    10. Feature encoding techniques
    11. One-hot encoding
    12. Label encoding
    13. Ordinal encoder
    14. Feature scaling
    15. Methods for feature scaling
    16. Feature transformation
    17. Feature splitting
    18. Summary
  15. Signal Processing and Time Series
    1. Technical requirements
    2. The statsmodels modules
    3. Moving averages
    4. Window functions
    5. Defining cointegration
    6. STL decomposition
    7. Autocorrelation
    8. Autoregressive models
    9. ARMA models
    10. Generating periodic signals
    11. Fourier analysis
    12. Spectral analysis filtering
    13. Summary
  16. Section 3: Deep Dive into Machine Learning
  17. Supervised Learning - Regression Analysis
    1. Technical requirements
    2. Linear regression
    3. Multiple linear regression
    4. Understanding multicollinearity
    5. Removing multicollinearity
    6. Dummy variables
    7. Developing a linear regression model
    8. Evaluating regression model performance
    9. R-squared
    10. MSE
    11. MAE
    12. RMSE
    13. Fitting polynomial regression
    14. Regression models for classification
    15. Logistic regression
    16. Characteristics of the logistic regression model
    17. Types of logistic regression algorithms
    18. Advantages and disadvantages of logistic regression
    19. Implementing logistic regression using scikit-learn
    20. Summary
  18. Supervised Learning - Classification Techniques
    1. Technical requirements
    2. Classification
    3. Naive Bayes classification
    4. Decision tree classification
    5. KNN classification
    6. SVM classification
    7. Terminology
    8. Splitting training and testing sets
    9. Holdout
    10. K-fold cross-validation
    11. Bootstrap method
    12. Evaluating the classification model performance
    13. Confusion matrix
    14. Accuracy
    15. Precision
    16. Recall
    17. F-measure
    18. ROC curve and AUC
    19. Summary
  19. Unsupervised Learning - PCA and Clustering
    1. Technical requirements
    2. Unsupervised learning
    3. Reducing the dimensionality of data
    4. PCA
    5. Performing PCA
    6. Clustering
    7. Finding the number of clusters
    8. The elbow method
    9. The silhouette method
    10. Partitioning data using k-means clustering
    11. Hierarchical clustering
    12. DBSCAN clustering
    13. Spectral clustering
    14. Evaluating clustering performance
    15. Internal performance evaluation
    16. The Davies-Bouldin index
    17. The silhouette coefficient
    18. External performance evaluation
    19. The Rand score
    20. The Jaccard score
    21. F-Measure or F1-score
    22. The Fowlkes-Mallows score
    23. Summary
  20. Section 4: NLP, Image Analytics, and Parallel Computing
  21. Analyzing Textual Data
    1. Technical requirements
    2. Installing NLTK and SpaCy
    3. Text normalization
    4. Tokenization
    5. Removing stopwords
    6. Stemming and lemmatization
    7. POS tagging
    8. Recognizing entities
    9. Dependency parsing
    10. Creating a word cloud
    11. Bag of Words
    12. TF-IDF
    13. Sentiment analysis using text classification
    14. Classification using BoW
    15. Classification using TF-IDF
    16. Text similarity
    17. Jaccard similarity
    18. Cosine similarity
    19. Summary
  22. Analyzing Image Data
    1. Technical requirements
    2. Installing OpenCV
    3. Understanding image data
    4. Binary images
    5. Grayscale images
    6. Color images
    7. Color models
    8. Drawing on images
    9. Writing on images
    10. Resizing images
    11. Flipping images
    12. Changing the brightness
    13. Blurring an image
    14. Face detection
    15. Summary
  23. Parallel Computing Using Dask
    1. Parallel computing using Dask
    2. Dask data types
    3. Dask Arrays
    4. Dask DataFrames
    5. DataFrame Indexing
    6. Filter data
    7. Groupby
    8. Converting a pandas DataFrame into a Dask DataFrame
    9. Converting a Dask DataFrame into a pandas DataFrame
    10. Dask Bags
    11. Creating a Dask Bag using Python iterable items
    12. Creating a Dask Bag using a text file
    13. Storing a Dask Bag in a text file
    14. Storing a Dask Bag in a DataFrame
    15. Dask Delayed
    16. Preprocessing data at scale
    17. Feature scaling in Dask
    18. Feature encoding in Dask
    19. Machine learning at scale
    20. Parallel computing using scikit-learn
    21. Reimplementing ML algorithms for Dask
    22. Logistic regression
    23. Clustering
    24. Summary
  24. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Python Data Analysis - Third Edition
  • Author(s): Avinash Navlani, Armando Fandango, Ivan Idris
  • Release date: February 2021
  • Publisher(s): Packt Publishing
  • ISBN: 9781789955248