Practical Lakehouse Architecture

Book description

This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can impact your data platform, from managing structured and unstructured data and supporting BI and AI/ML use cases to enabling more rigorous data governance and security measures.

Practical Lakehouse Architecture shows you how to:

  • Understand key lakehouse concepts and features like transaction support, time travel, and schema evolution
  • Understand the differences between traditional and lakehouse data architectures
  • Differentiate between various file formats and table formats
  • Design lakehouse architecture layers for storage, compute, metadata management, and data consumption
  • Implement data governance and data security within the platform
  • Evaluate technologies and decide on the best technology stack to implement the lakehouse for your use case
  • Make critical design decisions and address practical challenges to build a future-ready data platform
  • Start your lakehouse implementation journey and migrate data from existing systems to the lakehouse

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book?
    2. Why I Wrote This Book
    3. Navigating This Book
    4. O’Reilly Online Learning
    5. Conventions Used in This Book
    6. How to Contact Us
    7. Acknowledgments
  2. 1. Introduction to Lakehouse Architecture
    1. Understanding Data Architecture
      1. What Is Data Architecture?
      2. How Does Data Architecture Help Build a Data Platform?
      3. Core Components of a Data Platform
    2. Why Do We Need a New Data Architecture?
    3. Lakehouse Architecture: A New Pattern
      1. The Lakehouse: Best of Both Worlds
      2. Understanding Lakehouse Architecture
      3. Lakehouse Architecture Characteristics
      4. Lakehouse Architecture Benefits
    4. Key Takeaways
    5. References
  3. 2. Traditional Architectures and Modern Data Platforms
    1. Traditional Architectures: Data Lakes and Data Warehouses
      1. Data Warehouse Fundamentals
      2. Data Lake Fundamentals
    2. Modern Data Platforms
      1. Finding Answers in the Cloud
      2. Standalone Approach
      3. Combined Approach
      4. Expectations of Modern Data Platforms
    3. Comparison: Data Warehouse, Data Lake, Lakehouse
      1. Capabilities and Limitations
      2. Implementation Activities
      3. Administration and Management
      4. Business Outcomes
    4. Lakehouse Architecture: The Default Choice for Future Data Platforms?
    5. Key Takeaways
    6. References
  4. 3. Storage: The Heart of the Lakehouse
    1. Lakehouse Storage: Key Concepts
      1. Row Versus Columnar Storage
      2. Storage-based Performance Optimization
    2. Lakehouse Storage Components
      1. Cloud Object Storage
      2. File Formats
      3. Table Formats
    3. Key Design Considerations
      1. Ecosystem Support
      2. Community Support
      3. Supported File Formats
      4. Supported Compute Engines
      5. Supported Features
      6. Commercial Product Support
      7. Current and Future Versions
      8. Performance Benchmarking
      9. Comparisons
      10. Sharing Features
    4. Key Takeaways
    5. References
  5. 4. Data Catalogs
    1. Understanding Metadata
      1. Technical Metadata
      2. Business Metadata
    2. How Metastores and Data Catalogs Work Together
    3. Features of a Data Catalog
      1. Search, Explore, and Discover Data
      2. Data Classification
      3. Data Governance and Security
      4. Data Lineage
    4. Unified Data Catalog
      1. Challenges of Siloed Metadata Management
      2. What Is a Unified Data Catalog?
      3. Benefits of a Unified Data Catalog
    5. Implementing a Data Catalog: Key Design Considerations and Options
      1. Using Hive metastore
      2. Using AWS Services
      3. Using Azure Services
      4. Using GCP Services
      5. Using Databricks
    6. Key Takeaways
    7. References
  6. 5. Compute Engines for Lakehouse Architectures
    1. Data Computation Benefits of Lakehouse Architecture
      1. Independent Scaling
      2. Cross-region, Cross-account Access
      3. Unified Batch and Real-Time Processing
      4. Enhanced BI Performance
      5. Freedom to Choose Different Engine Types
      6. Cross-zone Analysis
    2. Compute Engine Options for Lakehouse Platforms
      1. Open Source Tools
      2. Cloud Services
      3. Third-Party Platforms
    3. Key Design Considerations
      1. Open Table Format Support
      2. Supported Version and Features
      3. Ecosystem Support
      4. Persona-Based Preferences
      5. Managed Open Source Versus Cloud Native Versus Third-Party Products
      6. Data Consumption Workloads
    4. Key Takeaways
    5. References
  7. 6. Data (and AI) Governance and Security in Lakehouse Architecture
    1. What Is Data Governance and Data Security?
    2. Benefits of Data Governance and Data Security
    3. Unified Governance and Security in Lakehouse Architecture
    4. Governance and Security Processes in Lakehouse Architecture
      1. Metadata Management
      2. Compliance and Regulations
      3. Data and ML Model Quality
      4. Lineage Across Data and AI assets
      5. Data and AI Asset Sharing
      6. Data Ownership
      7. Auditing and Monitoring
      8. Access Management
      9. Data Protection
      10. Handling Sensitive Data
    5. What’s Your Role?
    6. Key Takeaways
    7. References
  8. 7. The Big Picture: Designing and Implementing a Lakehouse Platform
    1. Pre-design Activities
      1. Understanding Platform Requirements
      2. Studying Existing System
      3. Understanding the Organization’s Vision and Data Strategy
      4. Conducting Workshops and Interviews
    2. Choosing the Right Architecture
    3. Establishing Guiding Principles
      1. Data Ecosystem
      2. Scalability and Performance
      3. Cost Control and Optimization
      4. Platform Operations
      5. Governance and Security
    4. Design Considerations and Implementation Best Practices
      1. Architecture Blueprint
      2. Data Ingestion
      3. Data Storage
      4. Data Processing
      5. Data Consumption and Delivery
      6. Common Services
    5. Design References
      1. Step-by-Step Design Guide
      2. Design Questionnaire
    6. Key Takeaways
    7. References
  9. 8. Lakehouse in the Real World
    1. Delivering a Real-World Lakehouse
    2. Estimation and Planning Phase
      1. Estimation
      2. Planning
    3. Analysis and Design Phase
      1. Analyzing the Existing System
      2. Data Modeling
      3. Finalizing the Tech Stack
    4. Implementation and Test Phase
      1. Historical Data Migration
      2. Data Reconciliation and Testing
      3. Reverse Engineering
      4. Data Quality and Handling Sensitive Data
    5. Support and Maintenance Phase
      1. Auditing and Tracking
      2. Disaster Recovery Strategy
      3. Decommissioning the Old System
    6. Delivery References
      1. Project Deliverables
      2. Reference Architectures
    7. Key Takeaways
    8. References
  10. 9. Lakehouse of the Future
    1. Warehouse to Lakehouse: What’s Next?
      1. Data Mesh
      2. HTAP
      3. Zero ETL
    2. Interoperability and New Formats
      1. Universal Format (UniForm)
      2. Apache XTable
      3. Upcoming File and Table Formats
    3. Managed Platforms for Public and Private Clouds
      1. Microsoft Fabric and Other Platforms
      2. Managed Lakehouse for Private Cloud Platform
    4. AI in a Lakehouse
    5. Key Takeaways
    6. Book Conclusion
    7. References
  11. Index
  12. About the Author

Product information

  • Title: Practical Lakehouse Architecture
  • Author(s): Gaurav Ashok Thalpati
  • Release date: July 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098153014