Data Wrangling on AWS

Book description

Revamp your data landscape and implement highly effective data pipelines in AWS with this hands-on guide Purchase of the print or Kindle book includes a free PDF eBook

Key Features

  • Execute extract, transform, and load (ETL) tasks on data lakes, data warehouses, and databases
  • Implement effective Pandas data operation with data wrangler
  • Integrate pipelines with AWS data services

Book Description

Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for analysis. Data Wrangling on AWS equips you with the knowledge to reap the full potential of AWS data wrangling tools.

First, you’ll be introduced to data wrangling on AWS and will be familiarized with data wrangling services available in AWS. You’ll understand how to work with AWS Glue DataBrew, AWS data wrangler, and AWS Sagemaker. Next, you’ll discover other AWS services like Amazon S3, Redshift, Athena, and Quicksight. Additionally, you’ll explore advanced topics such as performing Pandas data operation with AWS data wrangler, optimizing ML data with AWS SageMaker, building the data warehouse with Glue DataBrew, along with security and monitoring aspects.

By the end of this book, you’ll be well-equipped to perform data wrangling using AWS services.

What you will learn

  • Explore how to write simple to complex transformations using AWS data wrangler
  • Use abstracted functions to extract and load data from and into AWS datastores
  • Configure AWS Glue DataBrew for data wrangling
  • Develop data pipelines using AWS data wrangler
  • Integrate AWS security features into Data Wrangler using identity and access management (IAM)
  • Optimize your data with AWS SageMaker

Who this book is for

This book is for data engineers, data scientists, and business data analysts looking to explore the capabilities, tools, and services of data wrangling on AWS for their ETL tasks. Basic knowledge of Python, Pandas, and a familiarity with AWS tools such as AWS Glue, Amazon Athena is required to get the most out of this book.

Table of contents

  1. Data Wrangling on AWS
  2. Contributors
  3. About the authors
  4. About the reviewer
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  6. Part 1:Unleashing Data Wrangling with AWS
  7. Chapter 1: Getting Started with Data Wrangling
    1. Introducing data wrangling
      1. The 80-20 rule of data analysis
      2. Advantages of data wrangling
    2. The steps involved in data wrangling
      1. Data discovery
      2. Data structuring
      3. Data cleaning
      4. Data enrichment
      5. Data validation
      6. Data publishing
    3. Best practices for data wrangling
      1. Identifying the business use case
      2. Identifying the data source and bringing the right data
      3. Identifying your audience
    4. Options available for data wrangling on AWS
      1. AWS Glue DataBrew
      2. SageMaker Data Wrangler
      3. AWS SDK for pandas
    5. Summary
  8. Part 2:Data Wrangling with AWS Tools
  9. Chapter 2: Introduction to AWS Glue DataBrew
    1. Why AWS Glue DataBrew?
      1. AWS Glue DataBrew’s basic building blocks
    2. Getting started with AWS Glue DataBrew
      1. Understanding the pricing of AWS Glue DataBrew
    3. Using AWS Glue DataBrew for data wrangling
      1. Identifying the dataset
      2. Downloading the sample dataset
      3. Data discovery – creating an AWS Glue DataBrew profile for a dataset
      4. Data cleaning and enrichment – AWS Glue DataBrew transforms
      5. Data validation – performing data quality checks using AWS Glue DataBrew
      6. Data publication – fixing data quality issues
      7. Event-driven data quality check using Glue DataBrew
    4. Data protection with AWS Glue DataBrew
      1. Encryption at rest
      2. Encryption in transit
      3. Identifying and handling PII
    5. Data lineage and data publication
    6. Summary
  10. Chapter 3: Introducing AWS SDK for pandas
    1. AWS SDK for pandas
    2. Building blocks of AWS SDK for pandas
      1. Arrow
      2. pandas
      3. Boto3
    3. Customizing, building, and installing AWS SDK for pandas for different use cases
      1. Standard and custom installation on your local machine or Amazon EC2
      2. Standard and custom installation with Lambda functions
      3. Standard and custom installation for AWS Glue jobs
      4. Standard and custom installation on Amazon SageMaker notebooks
    4. Configuration options for AWS SDK for pandas
      1. Setting up global variables
      2. Common use cases for configuring
    5. The features of AWS SDK for pandas with different AWS services
      1. Amazon S3
      2. Amazon Athena
      3. RDS databases
      4. Redshift
    6. Summary
  11. Chapter 4: Introduction to SageMaker Data Wrangler
    1. Data import
    2. Data orchestration
    3. Data transformation
    4. Insights and data quality
    5. Data analysis
    6. Data export
    7. SageMaker Studio setup prerequisites
      1. Prerequisites
      2. Studio domain
      3. Studio onboarding steps
    8. Summary
  12. Part 3:AWS Data Management and Analysis
  13. Chapter 5: Working with Amazon S3
    1. What is big data?
    2. 5 Vs of big data
    3. What is a data lake?
      1. Building a data lake on Amazon S3
      2. Advantages of building a data lake on Amazon S3
      3. Design principles to design a data lake on Amazon S3
    4. Data lake layouts
      1. Organizing and structuring data within an Amazon S3 data lake
      2. Process of building a data lake on Amazon S3
      3. Selecting the right file format for a data lake
      4. Selecting the right compression method for a data lake
      5. Choosing the right partitioning strategy for a data lake
      6. Configuring Amazon S3 Lifecycle for a data lake
      7. Optimizing the number of files and the size of each file
    5. Challenges and considerations when building a data lake on Amazon S3
    6. Summary
  14. Chapter 6: Working with AWS Glue
    1. What is Apache Spark?
      1. Apache Spark architecture
      2. Apache Spark framework
      3. Resilient Distributed Datasets
      4. Datasets and DataFrames
    2. Data discovery with AWS Glue
      1. AWS Glue Data Catalog
      2. Glue Connections
      3. AWS Glue crawlers
      4. Table stats
    3. Data ingestion using AWS Glue ETL
      1. AWS GlueContext
      2. DynamicFrame
      3. AWS Glue Job bookmarks
      4. AWS Glue Triggers
      5. AWS Glue interactive sessions
      6. AWS Glue Studio
      7. Ingesting data from object stores
    4. Summary
  15. Chapter 7: Working with Athena
    1. Understanding Amazon Athena
      1. When to use SQL/Spark analysis options?
    2. Advanced data discovery and data structuring with Athena
      1. SQL-based data discovery with Athena
      2. Using CTAS for data structuring
    3. Enriching data from multiple sources using Athena
      1. Enriching data using Athena SQL joins
      2. Setting up data federation for source databases
      3. Enriching data with data federation
    4. Setting up a serverless data quality pipeline with Athena
      1. Implementing data quality rules in Athena
      2. Amazon DynamoDB as a metadata store for data quality pipelines
      3. Serverless data quality pipeline
      4. Automating the data quality pipeline
    5. Summary
  16. Chapter 8: Working with QuickSight
    1. Introducing Amazon QuickSight and its concepts
    2. Data discovery with QuickSight
      1. QuickSight-supported data sources and setup
      2. Data discovery with QuickSight analysis
      3. QuickSight Q and AI-based data analysis/discovery
    3. Data visualization with QuickSight
      1. Visualization and charts with QuickSight
      2. Embedded analytics
    4. Summary
  17. Part 4:Advanced Data Manipulation and ML Data Optimization
  18. Chapter 9: Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas
    1. A solution walkthrough for sportstickets.com
      1. Prerequisites for data ingestion
      2. When would you use them?
      3. Loading sample data into a source database
    2. Data discovery
      1. Exploring data using S3 Select commands
      2. Access through Amazon Athena and the Glue Catalog
    3. Data structuring
      1. Different file formats and when to use them
      2. Restructuring data using Pandas
      3. Flattening nested data with Pandas
    4. Data cleaning
      1. Data cleansing with Pandas
    5. Data enrichment
      1. Pandas operations for data transformation
    6. Data quality validation
      1. Data quality validation with Pandas
      2. Data quality validation integration with a data pipeline
    7. Data visualization
      1. Visualization with Python libraries
    8. Summary
  19. Chapter 10: Data Processing for Machine Learning with SageMaker Data Wrangler
    1. Technical requirements
    2. Step 1 – logging in to SageMaker Studio
    3. Step 2 – importing data
    4. Exploratory data analysis
      1. Built-in data insights
      2. Step 3 – creating data analysis
    5. Step 4 – adding transformations
      1. Categorical encoding
      2. Custom transformation
      3. Numeric scaling
      4. Dropping columns
    6. Step 5 – exporting data
    7. Training a machine learning model
    8. Summary
  20. Part 5:Ensuring Data Lake Security and Monitoring
  21. Chapter 11: Data Lake Security and Monitoring
    1. Data lake security
      1. Data lake access control
      2. Additional options to control data lake access
      3. AWS Lake Formation integration
      4. Data protection
      5. Securing your data in AWS Glue
    2. Monitoring and auditing
      1. Amazon CloudWatch
      2. Monitoring an AWS Glue job using AWS Glue ETL job monitoring
      3. Amazon CloudTrail
    3. Summary
  22. Index
    1. Why subscribe?
  23. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Data Wrangling on AWS
  • Author(s): Navnit Shukla, Sankar M, Sampat Palani
  • Release date: July 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781801810906