Book description
Build efficient data lakes that can scale to virtually unlimited size using AWS Glue
Key Features
Book Description
Organizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes.
Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You’ll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you’ll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options.
By the end of this AWS book, you’ll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.
What you will learn
- Apply various AWS Glue features to manage and create data lakes
- Use Glue DataBrew and Glue Studio for data preparation
- Optimize data layout in cloud storage to accelerate analytics workloads
- Manage metadata including database, table, and schema definitions
- Secure your data during access control, encryption, auditing, and networking
- Monitor AWS Glue jobs to detect delays and loss of data
- Integrate Spark ML and SageMaker with AWS Glue to create machine learning models
Who this book is for
ETL developers, data engineers, and data analysts
Table of contents
- Serverless ETL and Analytics with AWS Glue
- Contributors
- About the authors
- About the reviewers
- Preface
- Section 1 – Introduction, Concepts, and the Basics of AWS Glue
- Chapter 1: Data Management – Introduction and Concepts
- Chapter 2: Introduction to Important AWS Glue Features
- Chapter 3: Data Ingestion
- Section 2 – Data Preparation, Management, and Security
- Chapter 4: Data Preparation
- Chapter 5: Data Layouts
- Chapter 6: Data Management
- Chapter 7: Metadata Management
- Chapter 8: Data Security
-
Chapter 9: Data Sharing
- Technical requirements
- Overview of data sharing strategies
- Sharing data with multiple AWS accounts using S3 bucket policies and Glue catalog policies
-
Sharing data with multiple AWS accounts using AWS Lake Formation permissions
- Lake Formation permission model
- Lake Formation cross-account sharing
- Lake Formation named resource-based access control
- Lake Formation tag-based access control
- Scenario 2 – sharing data from one account with another using Lake Formation Tag-based access control
- Prerequisite – S3
- Prerequisite – Glue
- Prerequisite – Lake Formation and IAM
- Step 1 – configuring Glue catalog policies
- Step 2 – configuring Lake Formation permissions (producer)
- Step 3 – configuring Lake Formation permissions (consumer)
- Summary
-
Chapter 10: Data Pipeline Management
- Technical requirements
- What are data pipelines?
- Selecting the appropriate data processing services for your analysis
- Orchestrating your pipelines with workflow tools
- utomating how you provision your pipelines with provisioning tools
- Developing and maintaining your data pipelines
- Summary
- Further reading
- Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases
- Chapter 11: Monitoring
- Chapter 12: Tuning, Debugging, and Troubleshooting
-
Chapter 13: Data Analysis
- Creating Marketplace connections
- Creating the CloudFormation stack
- The benefit of ad hoc analysis and how a data lake enables it
- Creating and updating Hudi tables using Glue
- Creating and updating Delta Lake tables using Glue
- Inserting data into Lake Formation governed tables
- Consuming streaming data using Glue
- Glue’s integration with OpenSearch
- Cleaning up
- Summary
- Chapter 14: Machine Learning Integration
-
Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases
- Technical requirements
- Running a highly selective query on a big fact table using AWS Glue
- Dealing with Join performance issues with big fact and small dimension tables in ETL workloads
- Solving Join problems involving big fact and big dimension tables using AWS Glue
- Reducing time on read operations using AWS Glue grouping
- Solving S3 eventual consistency problems using AWS Glue
- Summary
- Why subscribe?
- Other Books You May Enjoy
Product information
- Title: Serverless ETL and Analytics with AWS Glue
- Author(s):
- Release date: August 2022
- Publisher(s): Packt Publishing
- ISBN: 9781800564985
You might also like
video
PySpark and AWS: Master Big Data with PySpark and AWS
The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports …
video
AWS Certified Data Analytics Specialty (2023) Hands-on
In this course, you will learn streaming massive data with AWS Kinesis; queuing messages with Simple …
book
Data Science on AWS
With this practical book, AI and machine learning practitioners will learn how to successfully build and …
video
Snowflake - Build and Architect Data Pipelines Using AWS
Snowflake is the next big thing, and it is becoming a full-blown data ecosystem. With the …