Book description
Looking to revolutionize your data transformation game with AWS? Look no further! From strong foundations to hands-on building of data engineering pipelines, our expert-led manual has got you covered.
Key Features
- Delve into robust AWS tools for ingesting, transforming, and consuming data, and for orchestrating pipelines
- Stay up to date with a comprehensive revised chapter on Data Governance
- Build modern data platforms with a new section covering transactional data lakes and data mesh
Book Description
This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability.
You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS.
By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!
What you will learn
- Seamlessly ingest streaming data with Amazon Kinesis Data Firehose
- Optimize, denormalize, and join datasets with AWS Glue Studio
- Use Amazon S3 events to trigger a Lambda process to transform a file
- Load data into a Redshift data warehouse and run queries with ease
- Visualize and explore data using Amazon QuickSight
- Extract sentiment data from a dataset using Amazon Comprehend
- Build transactional data lakes using Apache Iceberg with Amazon Athena
- Learn how a data mesh approach can be implemented on AWS
Who this book is for
This book is for data engineers, data analysts, and data architects who are new to AWS and looking to extend their skills to the AWS cloud. Anyone new to data engineering who wants to learn about the foundational concepts, while gaining practical experience with common data engineering services on AWS, will also find this book useful. A basic understanding of big data-related topics and Python coding will help you get the most out of this book, but it’s not a prerequisite. Familiarity with the AWS console and core services will also help you follow along.
Table of contents
- Preface
- Section 1: AWS Data Engineering Concepts and Trends
- An Introduction to Data Engineering
-
Data Management Architectures for Analytics
- Technical requirements
- The evolution of data management for analytics
- A deeper dive into data warehouse concepts and architecture
- An overview of data lake architecture and concepts
- Bringing together the best of data warehouses and data lakes
- Hands-on – using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets
- Summary
-
The AWS Data Engineer’s Toolkit
- Technical requirements
-
An overview of AWS services for ingesting data
- Amazon Database Migration Service (DMS)
- Amazon Kinesis for streaming data ingestion
- Amazon MSK for streaming data ingestion
- Amazon AppFlow for ingesting data from SaaS services
- AWS Transfer Family for ingestion using FTP/SFTP protocols
- AWS DataSync for ingesting from on premises and multicloud storage services
- The AWS Snow family of devices for large data transfers
- AWS Glue for data ingestion
- An overview of AWS services for transforming data
- An overview of AWS services for orchestrating big data pipelines
- An overview of AWS services for consuming data
- Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket
- Summary
-
Data Governance, Security, and Cataloging
- Technical requirements
- The many different aspects of data governance
- Data security, access, and privacy
- Data quality, data profiling, and data lineage
- Business and technical data catalogs
-
AWS services that help with data governance
- The AWS Glue/Lake Formation technical data catalog
- AWS Glue DataBrew for profiling datasets
- AWS Glue Data Quality
- AWS Key Management Service (KMS) for data encryption
- Amazon Macie for detecting PII data in Amazon S3 objects
- The AWS Glue Studio Detect PII transform for detecting PII data in datasets
- Amazon GuardDuty for detecting threats in an AWS account
- AWS Identity and Access Management (IAM) service
- Using AWS Lake Formation to manage data lake access
- Hands-on – configuring Lake Formation permissions
- Summary
- Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations
-
Architecting Data Engineering Pipelines
- Technical requirements
- Approaching the data pipeline architecture
- Identifying data consumers and understanding their requirements
- Identifying data sources and ingesting data
- Identifying data transformations and optimizations
- Loading data into data marts
- Wrapping up the whiteboarding session
- Hands-on – architecting a sample pipeline
- Summary
-
Ingesting Batch and Streaming Data
- Technical requirements
- Understanding data sources
- Ingesting data from a relational database
- Ingesting streaming data
- Hands-on – ingesting data with AWS DMS
- Hands-on – ingesting streaming data
- Summary
- Transforming Data to Optimize for Analytics
-
Identifying and Enabling Data Consumers
- Technical requirements
- Understanding the impact of data democratization
- Meeting the needs of business users with data visualization
- Meeting the needs of data analysts with structured reporting
- Meeting the needs of data scientists and ML models
- Hands-on – creating data transformations with AWS Glue DataBrew
- Summary
-
A Deeper Dive into Data Marts and Amazon Redshift
- Technical requirements
- Extending analytics with data warehouses/data marts
- What not to do – anti-patterns for a data warehouse
- Redshift architecture review and storage deep dive
- Designing a high-performance data warehouse
- Moving data between a data lake and Redshift
- Exploring advanced Redshift features
- Hands-on – deploying a Redshift Serverless cluster and running Redshift Spectrum queries
- Summary
-
Orchestrating the Data Pipeline
- Technical requirements
- Understanding the core concepts for pipeline orchestration
- Examining the options for orchestrating pipelines in AWS
- Hands-on – orchestrating a data pipeline using AWS Step Functions
- Summary
- Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
-
Ad Hoc Queries with Amazon Athena
- Technical requirements
- An introduction to Amazon Athena
- Tips and tricks to optimize Amazon Athena queries
- Exploring advanced Athena functionality
- Managing groups of users with Amazon Athena workgroups
- Hands-on – creating an Amazon Athena workgroup and configuring Athena settings
- Hands-on – switching workgroups and running queries
- Summary
-
Visualizing Data with Amazon QuickSight
- Technical requirements
- Representing data visually for maximum impact
- Understanding Amazon QuickSight’s core concepts
- Ingesting and preparing data from a variety of sources
- Creating and sharing visuals with QuickSight analyses and dashboards
- Understanding QuickSight’s advanced features
- Hands-on – creating a simple QuickSight visualization
- Summary
-
Enabling Artificial Intelligence and Machine Learning
- Technical requirements
- Understanding the value of AI and ML for organizations
- Exploring AWS services for ML
- Exploring AWS services for AI
- Building generative AI solutions on AWS
- Common use cases for LLMs
- Hands-on – reviewing reviews with Amazon Comprehend
- Summary
- Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World
-
Building Transactional Data Lakes
- Technical requirements
- What does it mean for a data lake to be transactional?
- An overview of Delta Lake, Apache Hudi, and Apache Iceberg
- AWS service integrations for building transactional data lakes
- Hands-on – Working with Apache Iceberg tables in AWS
- Summary
-
Implementing a Data Mesh Strategy
- Technical requirements
- What is a data mesh?
- Challenges that a data mesh approach attempts to resolve
- The organizational and technical challenges of building a data mesh
- AWS services that help enable a data mesh approach
- A sample architecture for a data mesh on AWS
- Hands-on – Setting up Amazon DataZone
- Summary
- Building a Modern Data Platform on AWS
-
Wrapping Up the First Part of Your Learning Journey
- Technical requirements
- Understanding the complexities of real-world data environments
- Examining examples of real-world data pipelines
-
Imagining the future – a look at emerging trends
- Increased adoption of a data mesh approach
- Requirement to work in a multi-cloud environment
- Migration to open table formats
- Managing costs with FinOps
- The merging of data warehouses and data lakes
- The application of generative AI to business intelligence and analytics
- The application of generative AI to building transformations
- Hands-on – cleaning up your AWS account
- Summary
- Other Books You May Enjoy
- Index
Product information
- Title: Data Engineering with AWS - Second Edition
- Author(s):
- Release date: October 2023
- Publisher(s): Packt Publishing
- ISBN: 9781804614426
You might also like
book
Data Engineering with AWS
The missing expert-led manual for the AWS ecosystem — go from foundations to building data engineering …
book
Serverless ETL and Analytics with AWS Glue
Build efficient data lakes that can scale to virtually unlimited size using AWS Glue Key Features …
book
Modern Data Architecture on AWS
Discover all the essential design and architectural patterns in one place to help you rapidly build …
book
Cost-Effective Data Pipelines
The low cost of getting started with cloud services can easily evolve into a significant expense …