Serverless ETL and Analytics with AWS Glue

Book description

Build efficient data lakes that can scale to virtually unlimited size using AWS Glue

Key Features

    Book Description

    Organizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes.

    Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You’ll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you’ll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options.

    By the end of this AWS book, you’ll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.

    What you will learn

    • Apply various AWS Glue features to manage and create data lakes
    • Use Glue DataBrew and Glue Studio for data preparation
    • Optimize data layout in cloud storage to accelerate analytics workloads
    • Manage metadata including database, table, and schema definitions
    • Secure your data during access control, encryption, auditing, and networking
    • Monitor AWS Glue jobs to detect delays and loss of data
    • Integrate Spark ML and SageMaker with AWS Glue to create machine learning models

    Who this book is for

    ETL developers, data engineers, and data analysts

Table of contents

  1. Serverless ETL and Analytics with AWS Glue
  2. Contributors
  3. About the authors
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Share Your Thoughts
  6. Section 1 – Introduction, Concepts, and the Basics of AWS Glue
  7. Chapter 1: Data Management – Introduction and Concepts
    1. Types of data processing – OLTP and OLAP
    2. Data warehouses and data marts
    3. Data lakes
    4. Data lakehouse
    5. Data mesh
    6. Distributed computing for big data
      1. Apache Spark
      2. Apache Spark on the AWS cloud
    7. AWS Glue
      1. Querying data using AWS
    8. Summary
  8. Chapter 2: Introduction to Important AWS Glue Features
    1. Data integration
    2. Integrating data with AWS Glue
      1. Data discovery
      2. Data ingestion
      3. Data preparation
      4. Data replication
    3. Features of AWS Glue
      1. AWS Glue Data Catalog
      2. Glue connections
      3. AWS Glue crawlers
      4. Custom classifiers
      5. AWS Glue Schema Registry
      6. AWS Glue ETL jobs
      7. Glue development endpoints
      8. AWS Glue interactive sessions
      9. Triggers
    4. Summary
  9. Chapter 3: Data Ingestion
    1. Technical requirements
    2. Data ingestion from file/object stores
      1. Data ingestion from Amazon S3
      2. Data ingestion from HDFS data stores
    3. Data ingestion from JDBC data stores
      1. AWS Glue custom JDBC connectors
    4. Data ingestion from streaming data sources
      1. AWS Glue Schema Registry
    5. Data ingestion from SaaS data stores
    6. Summary
  10. Section 2 – Data Preparation, Management, and Security
  11. Chapter 4: Data Preparation
    1. Technical requirements
    2. Introduction to data preparation
    3. Data preparation using AWS Glue
      1. Visual data preparation using AWS Glue DataBrew
      2. Source code-based approach to data preparation using AWS Glue
    4. Selecting the right service/tool
    5. Summary
  12. Chapter 5: Data Layouts
    1. Technical requirements
    2. Why do we need to pay attention to data layout?
    3. Key techniques to optimally storing data
      1. Selecting a file format
      2. Compressing your data
      3. Splittable or unsplittable files
      4. Partitioning
      5. Bucketing
    4. Optimizing the number of files and each file size
      1. What is compaction?
      2. Compaction with AWS Glue ETL Spark jobs
      3. Automatic Compaction with AWS Lake Formation acceleration
    5. Optimizing your storage with Amazon S3
      1. Selecting suitable S3 storage classes for your data
      2. Using S3 Lifecycle for managing object lifecycles
    6. Summary
    7. Further reading
  13. Chapter 6: Data Management
    1. Technical requirements
    2. Normalizing data
      1. Casting data types and map column names
      2. Inferring schemas
      3. Computing schemas on the fly
      4. Enforcing schemas
      5. Flattening nested schemas
      6. Normalizing scale
      7. Handling missing values and outliers
      8. Normalizing date and time values
      9. Handling error records
    3. Deduplicating records
    4. Denormalizing tables
    5. Securing data content
      1. Masking values
      2. Hashing values
    6. Managing data quality
      1. AWS Glue DataBrew data quality rules
      2. DeeQu
    7. Summary
  14. Chapter 7: Metadata Management
    1. Technical requirements
    2. Populating metadata
      1. Glue Data Catalog API
      2. DDL statements
      3. Glue crawlers
      4. Crawler configuration
    3. Maintaining metadata
      1. Glue crawlers
      2. Updating Data Catalog tables from ETL jobs
    4. Partition management
      1. Partition indexes
    5. Versioning and rollback
      1. Table versioning
      2. Lake Formation-governed tables
    6. Lineage
      1. Glue DataBrew
    7. Summary
  15. Chapter 8: Data Security
    1. Technical requirements
    2. Access control
      1. IAM permissions
      2. Glue dependencies on other AWS services
      3. S3 bucket policies
      4. S3 object ownership
      5. Lake Formation permissions
    3. Encryption
      1. Encryption at rest
      2. Encryption in transit
    4. Network
      1. Glue network architecture
      2. Glue connections
      3. Network configuration requirements and limitations
      4. Connecting to resources on the public internet
      5. Connecting to resources in your on-premise network
    5. Summary
  16. Chapter 9: Data Sharing
    1. Technical requirements
    2. Overview of data sharing strategies
      1. Single tenant
      2. Hub and spoke
      3. Data mesh
    3. Sharing data with multiple AWS accounts using S3 bucket policies and Glue catalog policies
      1. Scenario 1 – sharing data from one account with another using S3 bucket policies and Glue catalog policies
      2. Prerequisite – S3
      3. Prerequisite – Glue
      4. Configuring S3 bucket policies and Glue Catalog resource policies
    4. Sharing data with multiple AWS accounts using AWS Lake Formation permissions
      1. Lake Formation permission model
      2. Lake Formation cross-account sharing
      3. Lake Formation named resource-based access control
      4. Lake Formation tag-based access control
      5. Scenario 2 – sharing data from one account with another using Lake Formation Tag-based access control
      6. Prerequisite – S3
      7. Prerequisite – Glue
      8. Prerequisite – Lake Formation and IAM
      9. Step 1 – configuring Glue catalog policies
      10. Step 2 – configuring Lake Formation permissions (producer)
      11. Step 3 – configuring Lake Formation permissions (consumer)
    5. Summary
  17. Chapter 10: Data Pipeline Management
    1. Technical requirements
    2. What are data pipelines?
      1. Why do we need data pipelines?
      2. How do we build and manage data pipelines?
    3. Selecting the appropriate data processing services for your analysis
      1. AWS Batch
      2. Amazon ECS
      3. AWS Lambda
      4. AWS Glue ETL jobs
      5. Amazon EMR
    4. Orchestrating your pipelines with workflow tools
      1. Using AWS Glue workflows
      2. Using AWS Step Functions
      3. Using Amazon Managed Workflows for Apache Airflow
    5. utomating how you provision your pipelines with provisioning tools
      1. Provisioning resources with AWS CloudFormation
      2. Provisioning AWS Glue workflows and resources with AWS Glue Blueprints
    6. Developing and maintaining your data pipelines
      1. Developing AWS Glue ETL jobs locally
      2. Deploying AWS Glue ETL jobs 
      3. Deploying workflows and pipelines using provisioning tools such as IaC 
    7. Summary
    8. Further reading
  18. Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases
  19. Chapter 11: Monitoring
    1. Defining an SLA for a data platform
    2. Monitoring the SLA of a data platform
    3. Monitoring the components of a data platform
      1. Monitoring state changes
      2. Monitoring delay
      3. Monitoring performance
      4. Monitoring common failures
      5. Monitoring log messages
    4. Analyzing usage
    5. Summary
  20. Chapter 12: Tuning, Debugging, and Troubleshooting
    1. Tuning AWS Glue workloads
      1. Tuning AWS Glue crawlers
      2. Tuning the performance of AWS Glue Spark ETL jobs
    2. Troubleshooting and debugging common issues in AWS Glue ETL
      1. ETL job failures
    3. Summary
  21. Chapter 13: Data Analysis
    1. Creating Marketplace connections
      1. Creating the Glue Hudi connection
      2. Creating a Delta Lake connection
      3. Creating an OpenSearch connection
    2. Creating the CloudFormation stack
      1. Prerequisites for creating the CloudFormation stack
    3. The benefit of ad hoc analysis and how a data lake enables it
      1. Amazon Athena
      2. Amazon Redshift Spectrum
    4. Creating and updating Hudi tables using Glue
    5. Creating and updating Delta Lake tables using Glue
    6. Inserting data into Lake Formation governed tables
    7. Consuming streaming data using Glue
      1. Creating chapter-data-analysis-msk-connection
      2. Loading and consuming data from MSK using Glue
      3. Glue streaming job as a consumer of a Kafka topic
      4. Hudi DeltaStreamer streaming job as a consumer of a Kafka topic
      5. Creating and consuming CDC data through streaming jobs on Glue
    8. Glue’s integration with OpenSearch
    9. Cleaning up
    10. Summary
  22. Chapter 14: Machine Learning Integration
    1. Technical requirements
    2. Glue ML transformations
      1. Creating an ML transform
      2. Training an ML transform
      3. Using an ML transform
    3. SageMaker integration
    4. Developing ML pipelines with Glue
    5. Summary
  23. Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases
    1. Technical requirements
    2. Running a highly selective query on a big fact table using AWS Glue
      1. Hands-on tutorial
    3. Dealing with Join performance issues with big fact and small dimension tables in ETL workloads
    4. Solving Join problems involving big fact and big dimension tables using AWS Glue
      1. Hands-on tutorial
      2. Solution
    5. Reducing time on read operations using AWS Glue grouping
    6. Solving S3 eventual consistency problems using AWS Glue
      1. Using glueparquet
      2. S3-optimized output committer
    7. Summary
    8. Why subscribe?
  24. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts

Product information

  • Title: Serverless ETL and Analytics with AWS Glue
  • Author(s): Vishal Pathak, Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Albert Quiroga, Ishan Gaur
  • Release date: August 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781800564985