Book description
Simplify your ETL processes with these hands-on data hygiene tips, tricks, and best practices.
Key Features
- Focus on the basics of data wrangling
- Study various ways to extract the most out of your data in less time
- Boost your learning curve with bonus topics like random data generation and data integrity checks
Book Description
For data to be useful and meaningful, it must be curated and refined. Data Wrangling with Python teaches you the core ideas behind these processes and equips you with knowledge of the most popular tools and techniques in the domain.
The book starts with the absolute basics of Python, focusing mainly on data structures. It then delves into the fundamental tools of data wrangling like NumPy and Pandas libraries. You'll explore useful insights into why you should stay away from traditional ways of data cleaning, as done in other languages, and take advantage of the specialized pre-built routines in Python. This combination of Python tips and tricks will also demonstrate how to use the same Python backend and extract/transform data from an array of sources including the Internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, you'll cover how to handle missing or wrong data, and reformat it based on the requirements from the downstream analytics tool. The book will further help you grasp concepts through real-world examples and datasets.
By the end of this book, you will be confident in using a diverse array of sources to extract, clean, transform, and format your data efficiently.
What you will learn
- Use and manipulate complex and simple data structures
- Harness the full potential of DataFrames and numpy.array at run time
- Perform web scraping with BeautifulSoup4 and html5lib
- Execute advanced string search and manipulation with RegEX
- Handle outliers and perform data imputation with Pandas
- Use descriptive statistics and plotting techniques
- Practice data wrangling and modeling using data generation techniques
Who this book is for
Data Wrangling with Python is designed for developers, data analysts, and business analysts who are keen to pursue a career as a full-fledged data scientist or analytics expert. Although, this book is for beginners, prior working knowledge of Python is necessary to easily grasp the concepts covered here. It will also help to have rudimentary knowledge of relational database and SQL.
Table of contents
- Preface
- Chapter 1
-
Introduction to Data Wrangling with Python
- Introduction
- Python for Data Wrangling
-
Lists, Sets, Strings, Tuples, and Dictionaries
- Lists
- Exercise 1: Accessing the List Members
- Exercise 2: Generating a List
- Exercise 3: Iterating over a List and Checking Membership
- Exercise 4: Sorting a List
- Exercise 5: Generating a Random List
- Activity 1: Handling Lists
- Sets
- Introduction to Sets
- Union and Intersection of Sets
- Creating Null Sets
- Dictionary
- Exercise 6: Accessing and Setting Values in a Dictionary
- Exercise 7: Iterating Over a Dictionary
- Exercise 8: Revisiting the Unique Valued List Problem
- Exercise 9: Deleting Value from Dict
- Exercise 10: Dictionary Comprehension
- Tuples
- Creating a Tuple with Different Cardinalities
- Unpacking a Tuple
- Exercise 11: Handling Tuples
- Strings
- Exercise 12: Accessing Strings
- Exercise 13: String Slices
- String Functions
- Exercise 14: Split and Join
- Activity 2: Analyze a Multiline String and Generate the Unique Word Count
- Summary
- Chapter 2
-
Advanced Data Structures and File Handling
- Introduction
-
Advanced Data Structures
- Iterator
- Exercise 15: Introduction to the Iterator
- Stacks
- Exercise 16: Implementing a Stack in Python
- Exercise 17: Implementing a Stack Using User-Defined Methods
- Exercise 18: Lambda Expression
- Exercise 19: Lambda Expression for Sorting
- Exercise 20: Multi-Element Membership Checking
- Queue
- Exercise 21: Implementing a Queue in Python
- Activity 3: Permutation, Iterator, Lambda, List
- Basic File Operations in Python
- Summary
- Chapter 3
-
Introduction to NumPy, Pandas,and Matplotlib
- Introduction
-
NumPy Arrays
- NumPy Array and Features
- Exercise 26: Creating a NumPy Array (from a List)
- Exercise 27: Adding Two NumPy Arrays
- Exercise 28: Mathematical Operations on NumPy Arrays
- Exercise 29: Advanced Mathematical Operations on NumPy Arrays
- Exercise 30: Generating Arrays Using arange and linspace
- Exercise 31: Creating Multi-Dimensional Arrays
- Exercise 32: The Dimension, Shape, Size, and Data Type of the Two-dimensional Array
- Exercise 33: Zeros, Ones, Random, Identity Matrices, and Vectors
- Exercise 34: Reshaping, Ravel, Min, Max, and Sorting
- Exercise 35: Indexing and Slicing
- Conditional Subsetting
- Exercise 36: Array Operations (array-array, array-scalar, and universal functions)
- Stacking Arrays
- Pandas DataFrames
-
Statistics and Visualization with NumPy and Pandas
- Refresher of Basic Descriptive Statistics (and the Matplotlib Library for Visualization)
- Exercise 42: Introduction to Matplotlib Through a Scatter Plot
- Definition of Statistical Measures – Central Tendency and Spread
- Random Variables and Probability Distribution
- What Is a Probability Distribution?
- Discrete Distributions
- Continuous Distributions
- Data Wrangling in Statistics and Visualization
- Using NumPy and Pandas to Calculate Basic Descriptive Statistics on the DataFrame
- Random Number Generation Using NumPy
- Exercise 43: Generating Random Numbers from a Uniform Distribution
- Exercise 44: Generating Random Numbers from a Binomial Distribution and Bar Plot
- Exercise 45: Generating Random Numbers from Normal Distribution and Histograms
- Exercise 46: Calculation of Descriptive Statistics from a DataFrame
- Exercise 47: Built-in Plotting Utilities
- Activity 5: Generating Statistics from a CSV File
- Summary
- Chapter 4
-
A Deep Dive into Data Wrangling with Python
- Introduction
-
Subsetting, Filtering, and Grouping
- Exercise 48: Loading and Examining a Superstore's Sales Data from an Excel File
- Subsetting the DataFrame
- An Example Use Case: Determining Statistics on Sales and Profit
- Exercise 49: The unique Function
- Conditional Selection and Boolean Filtering
- Exercise 50: Setting and Resetting the Index
- Exercise 51: The GroupBy Method
- Detecting Outliers and Handling Missing Values
- Concatenating, Merging, and Joining
- Useful Methods of Pandas
- Summary
- Chapter 5
-
Getting Comfortable with Different Kinds of Data Sources
- Introduction
-
Reading Data from Different Text-Based (and Non-Text-Based) Sources
- Data Files Provided with This Chapter
- Libraries to Install for This Chapter
- Exercise 60: Reading Data from a CSV File Where Headers Are Missing
- Exercise 61: Reading from a CSV File where Delimiters are not Commas
- Exercise 62: Bypassing the Headers of a CSV File
- Exercise 63: Skipping Initial Rows and Footers when Reading a CSV File
- Reading Only the First N Rows (Especially Useful for Large Files)
- Exercise 64: Combining Skiprows and Nrows to Read Data in Small Chunks
- Setting the skip_blank_lines Option
- Read CSV from a Zip file
- Reading from an Excel File Using sheet_name and Handling a Distinct sheet_name
- Exercise 65: Reading a General Delimited Text File
- Reading HTML Tables Directly from a URL
- Exercise 66: Further Wrangling to Get the Desired Data
- Exercise 67: Reading from a JSON File
- Reading a Stata File
- Exercise 68: Reading Tabular Data from a PDF File
-
Introduction to Beautiful Soup 4 and Web Page Parsing
- Structure of HTML
- Exercise 69: Reading an HTML file and Extracting its Contents Using BeautifulSoup
- Exercise 70: DataFrames and BeautifulSoup
- Exercise 71: Exporting a DataFrame as an Excel File
- Exercise 72: Stacking URLs from a Document using bs4
- Activity 7: Reading Tabular Data from a Web Page and Creating DataFrames
- Summary
- Chapter 6
- Learning the Hidden Secrets of Data Wrangling
- Chapter 7
-
Advanced Web Scraping and Data Gathering
- Introduction
-
The Basics of Web Scraping and the Beautiful Soup Library
- Libraries in Python
- Exercise 81: Using the Requests Library to Get a Response from the Wikipedia Home Page
- Exercise 82: Checking the Status of the Web Request
- Checking the Encoding of the Web Page
- Exercise 83: Creating a Function to Decode the Contents of the Response and Check its Length
- Exercise 84: Extracting Human-Readable Text From a BeautifulSoup Object
- Extracting Text from a Section
- Extracting Important Historical Events that Happened on Today's Date
- Exercise 85: Using Advanced BS4 Techniques to Extract Relevant Text
- Exercise 86: Creating a Compact Function to Extract the "On this Day" Text from the Wikipedia Home Page
-
Reading Data from XML
- Exercise 87: Creating an XML File and Reading XML Element Objects
- Exercise 88: Finding Various Elements of Data within a Tree (Element)
- Reading from a Local XML File into an ElementTree Object
- Exercise 89: Traversing the Tree, Finding the Root, and Exploring all Child Nodes and their Tags and Attributes
- Exercise 90: Using the text Method to Extract Meaningful Data
- Extracting and Printing the GDP/Per Capita Information Using a Loop
- Exercise 91: Finding All the Neighboring Countries for each Country and Printing Them
- Exercise 92: A Simple Demo of Using XML Data Obtained by Web Scraping
-
Reading Data from an API
- Defining the Base URL (or API Endpoint)
- Exercise 93: Defining and Testing a Function to Pull Country Data from an API
- Using the Built-In JSON Library to Read and Examine Data
- Printing All the Data Elements
- Using a Function that Extracts a DataFrame Containing Key Information
- Exercise 94: Testing the Function by Building a Small Database of Countries' Information
-
Fundamentals of Regular Expressions (RegEx)
- Regex in the Context of Web Scraping
- Exercise 95: Using the match Method to Check Whether a Pattern matches a String/Sequence
- Using the Compile Method to Create a Regex Program
- Exercise 96: Compiling Programs to Match Objects
- Exercise 97: Using Additional Parameters in Match to Check for Positional Matching
- Finding the Number of Words in a List That End with "ing"
- Exercise 98: The search Method in Regex
- Exercise 99: Using the span Method of the Match Object to Locate the Position of the Matched Pattern
- Exercise 100: Examples of Single Character Pattern Matching with search
- Exercise 101: Examples of Pattern Matching at the Start or End of a String
- Exercise 102: Examples of Pattern Matching with Multiple Characters
- Exercise 103: Greedy versus Non-Greedy Matching
- Exercise 104: Controlling Repetitions to Match
- Exercise 105: Sets of Matching Characters
- Exercise 106: The use of OR in Regex using the OR Operator
- The findall Method
- Activity 9: Extracting the Top 100 eBooks from Gutenberg
- Activity 10: Building Your Own Movie Database by Reading an API
- Summary
- Chapter 8
-
RDBMS and SQL
- Introduction
- Refresher of RDBMS and SQL
-
Using an RDBMS (MySQL/PostgreSQL/SQLite)
- Exercise 107: Connecting to Database in SQLite
- Exercise 108: DDL and DML Commands in SQLite
- Reading Data from a Database in SQLite
- Exercise 109: Sorting Values that are Present in the Database
- Exercise 110: Altering the Structure of a Table and Updating the New Fields
- Exercise 111: Grouping Values in Tables
- Relation Mapping in Databases
- Adding Rows in the comments Table
- Joins
- Retrieving Specific Columns from a JOIN query
- Exercise 112: Deleting Rows
- Updating Specific Values in a Table
- Exercise 113: RDBMS and DataFrames
- Activity 11: Retrieving Data Correctly From Databases
- Summary
- Chapter 9
-
Application of Data Wrangling in Real Life
- Introduction
- Applying Your Knowledge to a Real-life Data Wrangling Task
- Activity 12: Data Wrangling Task – Fixing UN Data
- Activity 13: Data Wrangling Task – Cleaning GDP Data
- Activity 14: Data Wrangling Task – Merging UN Data and GDP Data
- Activity 15: Data Wrangling Task – Connecting the New Data to the Database
- An Extension to Data Wrangling
- Summary
-
Appendix
-
Solution of Activity 1: Handling Lists
- Solution of Activity 2: Analyze a Multiline String and Generate the Unique Word Count
- Solution of Activity 3: Permutation, Iterator, Lambda, List
- Solution of Activity 4: Design Your Own CSV Parser
- Solution of Activity 5: Generating Statistics from a CSV File
- Solution of Activity 6: Working with the Adult Income Dataset (UCI)
- Solution of Activity 7: Reading Tabular Data from a Web Page and Creating DataFrames
- Solution of Activity 8: Handling Outliers and Missing Data
- Solution of Activity 9: Extracting the Top 100 eBooks from Gutenberg
- Solution of Activity 10: Extracting the top 100 eBooks from Gutenberg.org
- Solution of Activity 11: Retrieving Data Correctly from Databases
- Solution of Activity 12: Data Wrangling Task – Fixing UN Data
- Activity 13: Data Wrangling Task – Cleaning GDP Data
- Solution of Activity 14: Data Wrangling Task – Merging UN Data and GDP Data
- Activity 15: Data Wrangling Task – Connecting the New Data to a Database
-
Solution of Activity 1: Handling Lists
Product information
- Title: Data Wrangling with Python
- Author(s):
- Release date: February 2019
- Publisher(s): Packt Publishing
- ISBN: 9781789800111
You might also like
book
Data Wrangling with Python
How do you take your data analysis skills beyond Excel to the next level? By learning …
book
Python for Data Science
Python is an ideal choice for accessing, manipulating, and gaining insights from data of all kinds. …
book
Python for Data Analysis
Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and …
book
Hands-On Data Preprocessing in Python
Get your raw data cleaned up and ready for processing to design better data analytic solutions …