Book description
Revamp your data landscape and implement highly effective data pipelines in AWS with this hands-on guide Purchase of the print or Kindle book includes a free PDF eBook
Key Features
- Execute extract, transform, and load (ETL) tasks on data lakes, data warehouses, and databases
- Implement effective Pandas data operation with data wrangler
- Integrate pipelines with AWS data services
Book Description
Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for analysis. Data Wrangling on AWS equips you with the knowledge to reap the full potential of AWS data wrangling tools.
First, you’ll be introduced to data wrangling on AWS and will be familiarized with data wrangling services available in AWS. You’ll understand how to work with AWS Glue DataBrew, AWS data wrangler, and AWS Sagemaker. Next, you’ll discover other AWS services like Amazon S3, Redshift, Athena, and Quicksight. Additionally, you’ll explore advanced topics such as performing Pandas data operation with AWS data wrangler, optimizing ML data with AWS SageMaker, building the data warehouse with Glue DataBrew, along with security and monitoring aspects.
By the end of this book, you’ll be well-equipped to perform data wrangling using AWS services.
What you will learn
- Explore how to write simple to complex transformations using AWS data wrangler
- Use abstracted functions to extract and load data from and into AWS datastores
- Configure AWS Glue DataBrew for data wrangling
- Develop data pipelines using AWS data wrangler
- Integrate AWS security features into Data Wrangler using identity and access management (IAM)
- Optimize your data with AWS SageMaker
Who this book is for
This book is for data engineers, data scientists, and business data analysts looking to explore the capabilities, tools, and services of data wrangling on AWS for their ETL tasks. Basic knowledge of Python, Pandas, and a familiarity with AWS tools such as AWS Glue, Amazon Athena is required to get the most out of this book.
Table of contents
- Data Wrangling on AWS
- Contributors
- About the authors
- About the reviewer
- Preface
- Part 1:Unleashing Data Wrangling with AWS
- Chapter 1: Getting Started with Data Wrangling
- Part 2:Data Wrangling with AWS Tools
-
Chapter 2: Introduction to AWS Glue DataBrew
- Why AWS Glue DataBrew?
- Getting started with AWS Glue DataBrew
-
Using AWS Glue DataBrew for data wrangling
- Identifying the dataset
- Downloading the sample dataset
- Data discovery – creating an AWS Glue DataBrew profile for a dataset
- Data cleaning and enrichment – AWS Glue DataBrew transforms
- Data validation – performing data quality checks using AWS Glue DataBrew
- Data publication – fixing data quality issues
- Event-driven data quality check using Glue DataBrew
- Data protection with AWS Glue DataBrew
- Data lineage and data publication
- Summary
- Chapter 3: Introducing AWS SDK for pandas
- Chapter 4: Introduction to SageMaker Data Wrangler
- Part 3:AWS Data Management and Analysis
-
Chapter 5: Working with Amazon S3
- What is big data?
- 5 Vs of big data
- What is a data lake?
-
Data lake layouts
- Organizing and structuring data within an Amazon S3 data lake
- Process of building a data lake on Amazon S3
- Selecting the right file format for a data lake
- Selecting the right compression method for a data lake
- Choosing the right partitioning strategy for a data lake
- Configuring Amazon S3 Lifecycle for a data lake
- Optimizing the number of files and the size of each file
- Challenges and considerations when building a data lake on Amazon S3
- Summary
- Chapter 6: Working with AWS Glue
- Chapter 7: Working with Athena
- Chapter 8: Working with QuickSight
- Part 4:Advanced Data Manipulation and ML Data Optimization
- Chapter 9: Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas
- Chapter 10: Data Processing for Machine Learning with SageMaker Data Wrangler
- Part 5:Ensuring Data Lake Security and Monitoring
- Chapter 11: Data Lake Security and Monitoring
- Index
- Other Books You May Enjoy
Product information
- Title: Data Wrangling on AWS
- Author(s):
- Release date: July 2023
- Publisher(s): Packt Publishing
- ISBN: 9781801810906
You might also like
book
Data Science on AWS
With this practical book, AI and machine learning practitioners will learn how to successfully build and …
book
Data Engineering with AWS
The missing expert-led manual for the AWS ecosystem — go from foundations to building data engineering …
video
PySpark and AWS: Master Big Data with PySpark and AWS
The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports …
book
Data Engineering with AWS - Second Edition
Looking to revolutionize your data transformation game with AWS? Look no further! From strong foundations to …