Book description
The missing expert-led manual for the AWS ecosystem — go from foundations to building data engineering pipelines effortlessly Purchase of the print or Kindle book includes a free eBook in the PDF format.
Key Features
- Learn about common data architectures and modern approaches to generating value from big data
- Explore AWS tools for ingesting, transforming, and consuming data, and for orchestrating pipelines
- Learn how to architect and implement data lakes and data lakehouses for big data analytics from a data lakes expert
Book Description
Written by a Senior Data Architect with over twenty-five years of experience in the business, Data Engineering for AWS is a book whose sole aim is to make you proficient in using the AWS ecosystem. Using a thorough and hands-on approach to data, this book will give aspiring and new data engineers a solid theoretical and practical foundation to succeed with AWS.
As you progress, you’ll be taken through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. You’ll also learn about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data.
By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently.
What you will learn
- Understand data engineering concepts and emerging technologies
- Ingest streaming data with Amazon Kinesis Data Firehose
- Optimize, denormalize, and join datasets with AWS Glue Studio
- Use Amazon S3 events to trigger a Lambda process to transform a file
- Run complex SQL queries on data lake data using Amazon Athena
- Load data into a Redshift data warehouse and run queries
- Create a visualization of your data using Amazon QuickSight
- Extract sentiment data from a dataset using Amazon Comprehend
Who this book is for
This book is for data engineers, data analysts, and data architects who are new to AWS and looking to extend their skills to the AWS cloud. Anyone new to data engineering who wants to learn about the foundational concepts while gaining practical experience with common data engineering services on AWS will also find this book useful. A basic understanding of big data-related topics and Python coding will help you get the most out of this book but it’s not a prerequisite. Familiarity with the AWS console and core services will also help you follow along.
Table of contents
- Data Engineering with AWS
- Contributors
- About the author
- Additional contributors
- About the reviewers
- Preface
- Section 1: AWS Data Engineering Concepts and Trends
- Chapter 1: An Introduction to Data Engineering
-
Chapter 2: Data Management Architectures for Analytics
- Technical requirements
- The evolution of data management for analytics
- Understanding data warehouses and data marts – fountains of truth
- Building data lakes to tame the variety and volume of big data
- Bringing together the best of both worlds with the lake house architecture
- Hands-on – configuring the AWS Command Line Interface tool and creating an S3 bucket
- Summary
-
Chapter 3: The AWS Data Engineer's Toolkit
- Technical requirements
-
AWS services for ingesting data
- Overview of Amazon Database Migration Service (DMS)
- Overview of Amazon Kinesis for streaming data ingestion
- Overview of Amazon MSK for streaming data ingestion
- Overview of Amazon AppFlow for ingesting data from SaaS services
- Overview of Amazon Transfer Family for ingestion using FTP/SFTP protocols
- Overview of Amazon DataSync for ingesting from on-premises storage
- Overview of the AWS Snow family of devices for large data transfers
- AWS services for transforming data
- AWS services for orchestrating big data pipelines
- AWS services for consuming data
- Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket
- Summary
-
Chapter 4: Data Cataloging, Security, and Governance
- Technical requirements
- Getting data security and governance right
- Cataloging your data to avoid the data swamp
- The AWS Glue/Lake Formation data catalog
- AWS services for data encryption and security monitoring
- AWS services for managing identity and permissions
- Hands-on – configuring Lake Formation permissions
- Summary
- Section 2: Architecting and Implementing Data Lakes and Data Lake Houses
-
Chapter 5: Architecting Data Engineering Pipelines
- Technical requirements
- Approaching the data pipeline architecture
- Identifying data consumers and understanding their requirements
- Identifying data sources and ingesting data
- Identifying data transformations and optimizations
- Loading data into data marts
- Wrapping up the whiteboarding session
- Hands-on – architecting a sample pipeline
- Summary
- Chapter 6: Ingesting Batch and Streaming Data
- Chapter 7: Transforming Data to Optimize for Analytics
-
Chapter 8: Identifying and Enabling Data Consumers
- Technical requirements
- Understanding the impact of data democratization
- Meeting the needs of business users with data visualization
- Meeting the needs of data analysts with structured reporting
- Meeting the needs of data scientists and ML models
- Hands-on – creating data transformations with AWS Glue DataBrew
- Summary
-
Chapter 9: Loading Data into a Data Mart
- Technical requirements
- Extending analytics with data warehouses/data marts
- What not to do – anti-patterns for a data warehouse
- Redshift architecture review and storage deep dive
- Designing a high-performance data warehouse
- Moving data between a data lake and Redshift
- Hands-on – loading data into an Amazon Redshift cluster and running queries
- Summary
-
Chapter 10: Orchestrating the Data Pipeline
- Technical requirements
- Understanding the core concepts for pipeline orchestration
-
Examining the options for orchestrating pipelines in AWS
- AWS Data Pipeline for managing ETL between data sources
- AWS Glue Workflows to orchestrate Glue resources
- Apache Airflow as an open source orchestration solution
- Pros and cons of using MWAA
- AWS Step Function for a serverless orchestration solution
- Pros and cons of using AWS Step Function
- Deciding on which data pipeline orchestration tool to use
- Hands-on – orchestrating a data pipeline using AWS Step Function
- Summary
- Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
-
Chapter 11: Ad Hoc Queries with Amazon Athena
- Technical requirements
- Amazon Athena – in-place SQL analytics for the data lake
- Tips and tricks to optimize Amazon Athena queries
- Federating the queries of external data sources with Amazon Athena Query Federation
- Managing governance and costs with Amazon Athena Workgroups
- Hands-on – creating an Amazon Athena workgroup and configuring Athena settings
- Hands-on – switching Workgroups and running queries
- Summary
-
Chapter 12: Visualizing Data with Amazon QuickSight
- Technical requirements
- Representing data visually for maximum impact
- Understanding Amazon QuickSight's core concepts
- Ingesting and preparing data from a variety of sources
- Creating and sharing visuals with QuickSight analyses and dashboards
- Understanding QuickSight's advanced features – ML Insights and embedded dashboards
- Hands-on – creating a simple QuickSight visualization
- Summary
- Chapter 13: Enabling Artificial Intelligence and Machine Learning
-
Chapter 14: Wrapping Up the First Part of Your Learning Journey
- Technical requirements
- Looking at the data analytics big picture
- Examining examples of real-world data pipelines
-
Imagining the future – a look at emerging trends
- ACID transactions directly on data lake data
- More data and more streaming ingestion
- Multi-cloud
- Decentralized data engineering teams, data platforms, and a data mesh architecture
- Data and product thinking convergence
- Data and self-serve platform design convergence
- Implementations of the data mesh architecture
- Hands-on – cleaning up your AWS account
- Summary
- Why subscribe?
- Other Books You May Enjoy
Product information
- Title: Data Engineering with AWS
- Author(s):
- Release date: December 2021
- Publisher(s): Packt Publishing
- ISBN: 9781800560413
You might also like
book
Data Engineering with AWS - Second Edition
Looking to revolutionize your data transformation game with AWS? Look no further! From strong foundations to …
book
Data Science on AWS
With this practical book, AI and machine learning practitioners will learn how to successfully build and …
book
Data Engineering with Google Cloud Platform
Build and deploy your own data pipelines on GCP, make key architectural decisions, and gain the …
video
Data Engineering with Python and AWS Lambda LiveLessons
7 Hours of Video Instruction Data Engineering with Python and AWS Lambda LiveLessons shows users how …