Modern Data Architecture on AWS

Book description

Discover all the essential design and architectural patterns in one place to help you rapidly build and deploy your modern data platform using AWS services

Key Features

  • Learn to build modern data platforms on AWS using data lakes and purpose-built data services
  • Uncover methods of applying security and governance across your data platform built on AWS
  • Find out how to operationalize and optimize your data platform on AWS
  • Purchase of the print or Kindle book includes a free PDF eBook

Book Description

Many IT leaders and professionals are adept at extracting data from a particular type of database and deriving value from it. However, designing and implementing an enterprise-wide holistic data platform with purpose-built data services, all seamlessly working in tandem with the least amount of manual intervention, still poses a challenge.

This book will help you explore end-to-end solutions to common data, analytics, and AI/ML use cases by leveraging AWS services. The chapters systematically take you through all the building blocks of a modern data platform, including data lakes, data warehouses, data ingestion patterns, data consumption patterns, data governance, and AI/ML patterns. Using real-world use cases, each chapter highlights the features and functionalities of numerous AWS services to enable you to create a scalable, flexible, performant, and cost-effective modern data platform.

By the end of this book, you’ll be equipped with all the necessary architectural patterns and be able to apply this knowledge to efficiently build a modern data platform for your organization using AWS services.

What you will learn

  • Familiarize yourself with the building blocks of modern data architecture on AWS
  • Discover how to create an end-to-end data platform on AWS
  • Design data architectures for your own use cases using AWS services
  • Ingest data from disparate sources into target data stores on AWS
  • Build data pipelines, data sharing mechanisms, and data consumption patterns using AWS services
  • Find out how to implement data governance using AWS services

Who this book is for

This book is for data architects, data engineers, and professionals creating data platforms. The book's use case–driven approach helps you conceptualize possible solutions to specific use cases, while also providing you with design patterns to build data platforms for any organization. It's beneficial for technical leaders and decision makers to understand their organization's data architecture and how each platform component serves business needs. A basic understanding of data & analytics architectures and systems is desirable along with beginner’s level understanding of AWS Cloud.

Table of contents

  1. Modern Data Architecture on AWS
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Conventions used
    5. Get in touch
    6. Share Your Thoughts
    7. Download a free PDF copy of this book
  6. Part 1: Foundational Data Lake
  7. Prologue: The Data and Analytics Journey So Far
    1. Introduction to the data and analytics journey
    2. Traditional data platforms
      1. Three-tier architecture
      2. Enterprise data warehouse (EDW)
    3. Challenges with on-premises data systems
    4. What this book is all about
    5. Summary
  8. Chapter 1: Modern Data Architecture on AWS
    1. Data lakes
      1. Why the need for a data lake?
      2. Challenges with on-premises data lakes
    2. The role of a modern data architecture
      1. Inside-out
      2. Outside-in
      3. Around the perimeter
      4. Sharing across
    3. Modern data architecture on AWS
    4. Pillars of a modern data architecture
      1. Scalable data lakes
      2. Purpose-built analytics services
      3. Unified data access
      4. Unified governance
      5. Performant and cost-effective
    5. Summary
  9. Chapter 2: Scalable Data Lakes
    1. Why choose Amazon S3 as a data lake store?
    2. Business scenario setup
    3. Data lake layers
      1. Raw layer
      2. Standardized layer
      3. Conformed layer
      4. Enriched layer
    4. Data lake patterns
      1. Centralized pattern
      2. Distributed pattern
    5. Data catalogs
      1. Glue Data Catalog
    6. Transactional data lakes
      1. Transactional data lakes using Apache Hudi
      2. Transactional data lakes using Apache Iceberg
      3. Transactional data lakes using Delta Lake
    7. Putting it all together
      1. Raw layer example
      2. Standardized layer example
      3. Conformed layer example
      4. Enriched layer example
    8. Summary
  10. Part 2: Purpose-Built Services And Unified Data Access
  11. Chapter 3: Batch Data Ingestion
    1. Database migration using AWS DMS
      1. AWS DMS overview
      2. AWS DMS usage patterns
    2. SaaS data ingestion using Amazon AppFlow
      1. AppFlow overview
      2. AppFlow usage patterns
    3. Data ingestion using AWS Glue
      1. Glue ETL overview
      2. Glue ETL usage patterns
    4. File and storage migration
      1. AWS DataSync
      2. DataSync usage patterns
      3. AWS Transfer Family
      4. AWS Transfer Family usage patterns
      5. AWS Snow Family
      6. AWS Snow Family usage patterns
    5. Summary
    6. References
  12. Chapter 4: Streaming Data Ingestion
    1. The need for streaming architectures and its challenges
    2. Streaming data ingestion using Amazon Kinesis
      1. Amazon Kinesis Data Streams
      2. KDS overview
      3. Amazon Kinesis Data Firehose
      4. Amazon Kinesis Data Analytics
    3. Streaming data ingestion using Amazon MSK
      1. Amazon MSK overview
    4. Streaming services usage patterns
    5. Summary
    6. References
  13. Chapter 5: Data Processing
    1. Challenges with data processing platforms
      1. Challenge 1 – Fixed costs for an on-premises data processing platform
      2. Challenge 2 – Compute is always on, even when not required
      3. Challenge 3 – Tight coupling between the storage and compute layers
      4. Challenge 4 – Scalability issues
      5. Challenge 5 – Operational issues
      6. Challenge 6 – Limited capabilities
      7. Challenge 7 – Third-party vendor lock-in
    2. Data processing using Amazon EMR
      1. Amazon EMR overview
      2. Use-case scenario 1 – Big data platform migration
      3. Use-case scenario 2 – Collaborative data engineering
    3. Data processing using AWS Glue
      1. Use-case scenario 1 – ETL pipelines using Spark
      2. Use-case scenario 2 – ETL for streaming data
    4. Data processing using AWS Glue DataBrew
      1. Use-case scenario – Low-code/no-code visual data processing
    5. Summary
    6. References
  14. Chapter 6: Interactive Analytics
    1. Analytics using Amazon Athena
      1. Amazon Athena basics
      2. Amazon Athena interactive analytics usage patterns
      3. Amazon Athena with Apache Hudi
      4. Amazon Athena with Apache Iceberg
      5. Amazon Athena with Delta Lake
      6. ETL with Amazon Athena
    2. Analytics using Presto, Trino, and Hive on Amazon EMR
      1. Presto/Trino
      2. Apache Hive
    3. Summary
    4. References
  15. Chapter 7: Data Warehousing
    1. The need for a data warehouse
    2. Data warehousing using Amazon Redshift
      1. Amazon Redshift basics
    3. Data warehouse modernization using Redshift
    4. Data ingestion patterns
      1. Data ingestion using AWS DMS
      2. Data ingestion using auto-copy
      3. Data ingestion using zero-ETL
      4. Data ingestion for real-time streaming data
    5. Data transformation using ELT patterns
      1. Stored procedures
      2. Materialized views (MVs)
    6. Data security and governance patterns
      1. Fine-grained access control
    7. Data consumption patterns
      1. Redshift Spectrum
      2. Redshift Data APIs
      3. SQL reports
      4. Business intelligence (BI) dashboards
    8. Summary
    9. References
  16. Chapter 8: Data Sharing
    1. Internal data sharing
      1. Data sharing using Amazon Athena
      2. Data sharing using Amazon Redshift
    2. External data sharing
      1. Data sharing using AWS Data Exchange
    3. Summary
    4. References
  17. Chapter 9: Data Federation
    1. Data federation using Amazon Athena
      1. Amazon Athena Federated Query overview
      2. Amazon Athena Federated Query use case
    2. Data federation using Amazon Redshift
      1. Amazon Redshift federated queries use case
    3. Summary
    4. References
  18. Chapter 10: Predictive Analytics
    1. Role of AI/ML in predictive analytics
    2. Barriers to AI/ML adoption
    3. AWS AI/ML services overview
    4. AWS AI services, along with use cases
    5. ML using Amazon SageMaker, along with use cases
      1. Amazon SageMaker Canvas
      2. Amazon SageMaker Studio
      3. Amazon SageMaker Data Wrangler
      4. Amazon SageMaker Feature Store
      5. Amazon SageMaker Studio notebooks
    6. ML using Amazon Redshift and Amazon Athena
    7. Summary
    8. References
  19. Chapter 11: Generative AI
    1. How does generative AI help different industries?
      1. Financial services
      2. Healthcare
      3. Life sciences
      4. Manufacturing
      5. Media and entertainment
    2. Fundamentals of generative AI
    3. Generative AI on AWS
      1. Amazon Bedrock
      2. Amazon EC2’s Trn1n, Inf1/Inf2, and P5 instances
      3. Amazon CodeWhisperer
      4. FMs using SageMaker JumpStart
    4. Analytics use case with GenAI
    5. Summary
    6. References
  20. Chapter 12: Operational Analytics
    1. Amazon OpenSearch Service
    2. Amazon OpenSearch Service use cases
      1. Application and security monitoring
      2. Observability
    3. Summary
    4. References
  21. Chapter 13: Business Intelligence
    1. Amazon QuickSight
    2. Amazon QuickSight use-cases
      1. Interactive dashboards
      2. Embedded Insights
      3. BI using ML powered NLQ
    3. Summary
    4. References
  22. Part 3: Govern, Scale, Optimize And Operationalize
  23. Chapter 14: Data Governance
    1. What is data governance?
      1. Why the need for data governance?
    2. Data governance on AWS
    3. Data governance using Amazon DataZone
      1. Amazon DataZone
    4. Fine-grained access control using AWS Lake Formation
      1. AWS Lake Formation
    5. Improving data quality using Glue Data Quality
      1. Glue Data Quality
    6. Sensitive data discovery with Amazon Macie
      1. Amazon Macie
    7. Data collaborations with partners using AWS Clean Rooms
      1. AWS Clean Rooms
    8. Data resolution with AWS Entity Resolution
      1. AWS Entity Resolution
    9. Summary
    10. References
  24. Chapter 15: Data Mesh
    1. Data mesh concepts
      1. What is data mesh?
    2. Data mesh on AWS
    3. Data mesh on an Amazon S3-based data lake
    4. Data mesh on Amazon Redshift
    5. Summary
    6. References
  25. Chapter 16: Performant and Cost-Effective Data Platform
    1. Why does a performant and cost-effective data platform matter?
    2. Data storage optimizations
      1. Amazon S3 optimizations
    3. Compute resource optimizations
      1. Compute instance families
      2. Other services
      3. Capacity reservations and discounted compute instances
    4. Cost optimization tools
      1. AWS Cost Explorer
      2. AWS Budgets
      3. AWS Cost Anomaly Detection
      4. AWS Cost and Usage Reports
      5. AWS Trusted Advisor
      6. AWS Savings Plans
      7. AWS Compute Optimizer
    5. Tool-specific performance tuning
      1. Performance tuning measures on Amazon Redshift
    6. Summary
    7. References
  26. Chapter 17: Automate, Operationalize, and Monetize
    1. The need for automation
    2. The DevOps process
      1. Key DevOps components
    3. The DataOps process
      1. Amazon MWAA
      2. AWS Step Functions
    4. The MLOps process
    5. Data monetization
      1. DaaS
      2. Insights-as-a-service
      3. API-as-a-service
    6. Wrap-up
    7. Summary
    8. References
  27. Index
    1. Why subscribe?
  28. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Modern Data Architecture on AWS
  • Author(s): Behram Irani
  • Release date: August 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781801813396