Hands-On Data Preprocessing in Python

Book description

Get your raw data cleaned up and ready for processing to design better data analytic solutions

Key Features

  • Develop the skills to perform data cleaning, data integration, data reduction, and data transformation
  • Make the most of your raw data with powerful data transformation and massaging techniques
  • Perform thorough data cleaning, including dealing with missing values and outliers

Book Description

Hands-On Data Preprocessing is a primer on the best data cleaning and preprocessing techniques, written by an expert who's developed college-level courses on data preprocessing and related subjects.

With this book, you'll be equipped with the optimum data preprocessing techniques from multiple perspectives, ensuring that you get the best possible insights from your data.

You'll learn about different technical and analytical aspects of data preprocessing - data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment.

The hands-on examples and easy-to-follow chapters will help you gain a comprehensive articulation of data preprocessing, its whys and hows, and identify opportunities where data analytics could lead to more effective decision making. As you progress through the chapters, you'll also understand the role of data management systems and technologies for effective analytics and how to use APIs to pull data.

By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques, and handle outliers or missing values to effectively prepare data for analytic tools.

What you will learn

  • Use Python to perform analytics functions on your data
  • Understand the role of databases and how to effectively pull data from databases
  • Perform data preprocessing steps defined by your analytics goals
  • Recognize and resolve data integration challenges
  • Identify the need for data reduction and execute it
  • Detect opportunities to improve analytics with data transformation

Who this book is for

This book is for junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data. You don't need any prior experience with data preprocessing to get started with this book. However, basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are a prerequisite.

Table of contents

  1. Hands-On Data Preprocessing in Python
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Share Your Thoughts
  6. Part 1:Technical Needs
  7. Chapter 1: Review of the Core Modules of NumPy and Pandas
    1. Technical requirements
    2. Overview of the Jupyter Notebook
    3. Are we analyzing data via computer programming?
    4. Overview of the basic functions of NumPy
      1. The np.arange() function
      2. The np.zeros() and np.ones() functions
      3. The np.linspace() function
    5. Overview of Pandas
      1. Pandas data access
      2. Boolean masking for filtering a DataFrame
      3. Pandas functions for exploring a DataFrame
      4. Pandas applying a function
      5. The Pandas groupby function
      6. Pandas multi-level indexing
      7. Pandas pivot and melt functions
    6. Summary
    7. Exercises
  8. Chapter 2: Review of Another Core Module – Matplotlib
    1. Technical requirements
    2. Drawing the main plots in Matplotlib
      1. Summarizing numerical attributes using histograms or boxplots
      2. Observing trends in the data using a line plot
      3. Relating two numerical attributes using a scatterplot
    3. Modifying the visuals
      1. Adding a title to visuals and labels to the axis
      2. Adding legends
      3. Modifying ticks
      4. Modifying markers
    4. Subplots
    5. Resizing visuals and saving them
      1. Resizing
      2. Saving
    6. Example of Matplotilb assisting data preprocessing
    7. Summary
    8. Exercises
  9. Chapter 3: Data – What Is It Really?
    1. Technical requirements
    2. What is data?
      1. Why this definition?
      2. DIKW pyramid
      3. Data preprocessing for data analytics versus data preprocessing for machine learning
    3. The most universal data structure – a table
      1. Data objects
      2. Data attributes
    4. Types of data values
      1. Analytics standpoint
      2. Programming standpoint
    5. Information versus pattern
      1. Understanding everyday use of the word "information"
      2. Statistical use of the word "information"
      3. Statistical meaning of the word "pattern"
    6. Summary
    7. Exercises
    8. References
  10. Chapter 4: Databases
    1. Technical requirements
    2. What is a database?
      1. Understanding the difference between a database and a dataset
    3. Types of databases
      1. The differentiating elements of databases
      2. Relational databases (SQL databases)
      3. Unstructured databases (NoSQL databases)
      4. A practical example that requires a combination of both structured and unstructured databases
      5. Distributed databases
      6. Blockchain
    4. Connecting to, and pulling data from, databases
      1. Direct connection
      2. Web page connection
      3. API connection
      4. Request connection
      5. Publicly shared
    5. Summary
    6. Exercises
  11. Part 2: Analytic Goals
  12. Chapter 5: Data Visualization
    1. Technical requirements
    2. Summarizing a population
      1. Example of summarizing numerical attributes
      2. Example of summarizing categorical attributes
    3. Comparing populations
      1. Example of comparing populations using boxplots
      2. Example of comparing populations using histograms
      3. Example of comparing populations using bar charts
    4. Investigating the relationship between two attributes
      1. Visualizing the relationship between two numerical attributes
      2. Visualizing the relationship between two categorical attributes
      3. Visualizing the relationship between a numerical attribute and a categorical attribute
    5. Adding visual dimensions
      1. Example of a five-dimensional scatter plot
    6. Showing and comparing trends
      1. Example of visualizing and comparing trends
    7. Summary
    8. Exercise
  13. Chapter 6: Prediction
    1. Technical requirements
    2. Predictive models
      1. Forecasting
      2. Regression analysis
    3. Linear regression
      1. Example of applying linear regression to perform regression analysis
    4. MLP
      1. How does MLP work?
      2. Example of applying MLP to perform regression analysis
    5. Summary
    6. Exercises
  14. Chapter 7: Classification
    1. Technical requirements
    2. Classification models
      1. Example of designing a classification model
      2. Classification algorithms
    3. KNN
      1. Example of using KNN for classification
    4. Decision Trees
      1. Example of using Decision Trees for classification
    5. Summary
    6. Exercises
  15. Chapter 8: Clustering Analysis
    1. Technical requirements
    2. Clustering model
      1. Clustering example using a two-dimensional dataset
      2. Clustering example using a three-dimensional dataset
    3. K-Means algorithm
      1. Using K-Means to cluster a two-dimensional dataset
      2. Using K-Means to cluster a dataset with more than two dimensions
      3. Centroid analysis
    4. Summary
    5. Exercises
  16. Part 3: The Preprocessing
  17. Chapter 9: Data Cleaning Level I – Cleaning Up the Table
    1. Technical requirements
    2. The levels, tools, and purposes of data cleaning – a roadmap to chapters 9, 10, and 11
      1. Purpose of data analytics
      2. Tools for data analytics
      3. Levels of data cleaning
      4. Mapping the purposes and tools of analytics to the levels of data cleaning
    3. Data cleaning level I – cleaning up the table
      1. Example 1 – unwise data collection
      2. Example 2 – reindexing (multi-level indexing)
      3. Example 3 – intuitive but long column titles
    4. Summary
    5. Exercises
  18. Chapter 10: Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table
    1. Technical requirements
    2. Example 1 – unpacking columns and reformulating the table
      1. Unpacking FileName
      2. Unpacking Content
      3. Reformulating a new table for visualization
      4. The last step – drawing the visualization
    3. Example 2 – restructuring the table
    4. Example 3 – level I and II data cleaning
      1. Level I cleaning
      2. Level II cleaning
      3. Doing the analytics – using linear regression to create a predictive model
    5. Summary
    6. Exercises
  19. Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors
    1. Technical requirements
    2. Missing values
      1. Detecting missing values
      2. Example of detecting missing values
      3. Causes of missing values
      4. Types of missing values
      5. Diagnosis of missing values
      6. Dealing with missing values
    3. Outliers
      1. Detecting outliers
      2. Dealing with outliers
    4. Errors
      1. Types of errors
      2. Dealing with errors
      3. Detecting systematic errors
    5. Summary
    6. Exercises
  20. Chapter 12: Data Fusion and Data Integration
    1. Technical requirements
    2. What are data fusion and data integration?
      1. Data fusion versus data integration
      2. Directions of data integration
    3. Frequent challenges regarding data fusion and integration
      1. Challenge 1 – entity identification
      2. Challenge 2 – unwise data collection
      3. Challenge 3 – index mismatched formatting
      4. Challenge 4 – aggregation mismatch
      5. Challenge 5 – duplicate data objects
      6. Challenge 6 – data redundancy
    4. Example 1 (challenges 3 and 4)
    5. Example 2 (challenges 2 and 3)
    6. Example 3 (challenges 1, 3, 5, and 6)
      1. Checking for duplicate data objects
      2. Designing the structure for the result of data integration
      3. Filling songIntegrate_df from billboard_df
      4. Filling songIntegrate_df from songAttribute_df
      5. Filling songIntegrate_df from artist_df
      6. Checking for data redundancy
      7. The analysis
      8. Example summary
    7. Summary
    8. Exercise
  21. Chapter 13: Data Reduction
    1. Technical requirements
    2. The distinction between data reduction and data redundancy
      1. The objectives of data reduction
    3. Types of data reduction
    4. Performing numerosity data reduction
      1. Random sampling
      2. Stratified sampling
      3. Random over/undersampling
    5. Performing dimensionality data reduction
      1. Linear regression as a dimension reduction method
      2. Using a decision tree as a dimension reduction method
      3. Using random forest as a dimension reduction method
      4. Brute-force computational dimension reduction
      5. PCA
      6. Functional data analysis
    6. Summary
    7. Exercises
  22. Chapter 14: Data Transformation and Massaging
    1. Technical requirements
    2. The whys of data transformation and massaging
      1. Data transformation versus data massaging
    3. Normalization and standardization
    4. Binary coding, ranking transformation, and discretization
      1. Example one – binary coding of nominal attribute
      2. Example two – binary coding or ranking transformation of ordinal attributes
      3. Example three – discretization of numerical attributes
      4. Understanding the types of discretization
      5. Discretization – the number of cut-off points
      6. A summary – from numbers to categories and back
    5. Attribute construction
      1. Example – construct one transformed attribute from two attributes
    6. Feature extraction
      1. Example – extract three attributes from one attribute
      2. Example – Morphological feature extraction
      3. Feature extraction examples from the previous chapters
    7. Log transformation
      1. Implementation – doing it yourself
      2. Implementation – the working module doing it for you
    8. Smoothing, aggregation, and binning
      1. Smoothing
      2. Aggregation
      3. Binning
    9. Summary
    10. Exercise
  23. Part 4: Case Studies
  24. Chapter 15: Case Study 1 – Mental Health in Tech
    1. Technical requirements
    2. Introducing the case study
      1. The audience of the results of analytics
      2. Introduction to the source of the data
    3. Integrating the data sources
    4. Cleaning the data
      1. Detecting and dealing with outliers and errors
      2. Detecting and dealing with missing values
    5. Analyzing the data
      1. Analysis question one – is there a significant difference between the mental health of employees across the attribute of gender?
      2. Analysis question two – is there a significant difference between the mental health of employees across the Age attribute?
      3. Analysis question three – do more supportive companies have mentally healthier employees?
      4. Analysis question four – does the attitude of individuals toward mental health influence their mental health and their seeking of treatments?
    6. Summary
  25. Chapter 16: Case Study 2 – Predicting COVID-19 Hospitalizations
    1. Technical requirements
    2. Introducing the case study
      1. Introducing the source of the data
    3. Preprocessing the data
      1. Designing the dataset to support the prediction
      2. Filling up the placeholder dataset
      3. Supervised dimension reduction
    4. Analyzing the data
    5. Summary
  26. Chapter 17: Case Study 3: United States Counties Clustering Analysis
    1. Technical requirements
    2. Introducing the case study
      1. Introduction to the source of the data
    3. Preprocessing the data
      1. Transforming election_df to partisan_df
      2. Cleaning edu_df, employ_df, pop_df, and pov_df
      3. Data integration
      4. Data cleaning level III – missing values, errors, and outliers
      5. Checking for data redundancy
    4. Analyzing the data
      1. Using PCA to visualize the dataset
      2. K-Means clustering analysis
    5. Summary
  27. Chapter 18: Summary, Practice Case Studies, and Conclusions
    1. A summary of the book
      1. Part 1 – Technical requirements
      2. Part 2 – Analytics goals
      3. Part 3 – The preprocessing
      4. Part 4 – Case studies
    2. Practice case studies
      1. Google Covid-19 mobility dataset
      2. Police killings in the US
      3. US accidents
      4. San Francisco crime
      5. Data analytics job market
      6. FIFA 2018 player of the match
      7. Hot hands in basketball
      8. Wildfires in California
      9. Silicon Valley diversity profile
      10. Recognizing fake job posting
      11. Hunting more practice case studies
    3. Conclusions
    4. Why subscribe?
  28. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts

Product information

  • Title: Hands-On Data Preprocessing in Python
  • Author(s): Roy Jafari
  • Release date: January 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781801072137