Streaming Databases

Book description

Real-time applications are becoming the norm today. But building a model that works properly requires real-time data from the source, in-flight stream processing, and low latency serving of its analytics. With this practical book, data engineers, data architects, and data analysts will learn how to use streaming databases to build real-time solutions.

Authors Hubert Dulay and Ralph M. Debusmann take you through streaming database fundamentals, including how these databases reduce infrastructure for real-time solutions. You'll learn the difference between streaming databases, stream processing, and real-time online analytical processing (OLAP) databases. And you'll discover when to use push queries versus pull queries, and how to serve synchronous and asynchronous data emanating from streaming databases.

This guide helps you:

  • Explore stream processing and streaming databases
  • Learn how to build a real-time solution with a streaming database
  • Understand how to construct materialized views from any number of streams
  • Learn how to serve synchronous and asynchronous data
  • Get started building low-complexity streaming solutions with minimal setup

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Hubert’s Acknowledgements
    6. Ralph’s Acknowledgments
  3. 1. Streaming Foundations
    1. Turning the Database Inside Out
    2. Externalizing Database Features
      1. Write-Ahead Log
      2. Streaming Platforms
      3. Materialized Views
    3. Use Case: Clickstream Analysis
      1. Understanding Transactions and Events
      2. Domain-Driven Design
    4. Context Enrichment
    5. Change Data Capture
    6. Connectors
      1. Connector Middleware
      2. Embedded
      3. Custom-Built
    7. Summary
  4. 2. Stream Processing Platforms
    1. Stateful Transformations
    2. Data Pipelines
      1. ELT Limitations
      2. Stream Processing with ELT
    3. Stream Processors
      1. Popular Stream Processors
      2. Newer Stream Processors
    4. Emulating Materialized Views in Apache Spark
    5. Two Types of Streams
      1. Append Stream
      2. Debezium Change Data
      3. Materialized Views
    6. Summary
  5. 3. Serving Real-Time Data
    1. Real-Time Expectations
    2. Choosing an Analytical Data Store
    3. Sourcing from a Topic
    4. Ingestion Transformations
    5. OLTP Versus OLAP
      1. ACID
      2. Row- Versus Column-Based Optimization
    6. Queries Per Second and Concurrency
    7. Indexing
    8. Serving Analytical Results
      1. Synchronous Queries
      2. Asynchronous Queries
      3. Push Versus Pull Queries
    9. Summary
  6. 4. Materialized Views
    1. Views, Materialized Views, and Incremental Updates
    2. Change Data Capture
    3. Push Versus Pull Queries
    4. CDC and Upsert
    5. Joining Streams
      1. Apache Calcite
      2. Clickstream Use Case
    6. Summary
  7. 5. Introduction to Streaming Databases
    1. Identifying the Streaming Database
      1. Column-Based Streaming Database
      2. Row-Based Streaming Database
      3. Edge Streaming-Like Databases
    2. SQL Expressivity
    3. Streaming Debuggability
      1. Advantages of Debugging in Streaming Databases
      2. SQL Is Not a Silver Bullet
    4. Streaming Database Implementations
    5. Streaming Database Architecture
    6. ELT with Streaming Databases
    7. Summary
  8. 6. Consistency
    1. A Toy Example
      1. Transactions
      2. Analyzing the Transactions
    2. Comparing Consistency Across Stream Processing Systems
      1. Flink SQL
      2. ksqlDB
      3. Proton (Timeplus)
      4. RisingWave
      5. Materialize
      6. Pathway
      7. Out-of-Order Messages
    3. Going Beyond Eventual Consistency
      1. Why Do Eventually Consistent Stream Processors Fail the Toy Example?
      2. How Do Internally Consistent Stream Processing Systems Pass the Toy Example?
      3. How Can We Fix Eventually Consistent Stream Processing Systems to Pass the Toy Example?
    4. Consistency Versus Latency
    5. Summary
  9. 7. Emergence of Other Hybrid Data Systems
    1. Data Planes
    2. Hybrid Transactional/Analytical Database
    3. Other Hybrid Databases
    4. Motivations for Hybrid Systems
    5. The Influence of PostgreSQL on Hybrid Databases
    6. Near-Edge Analytics
    7. Next-Generation Hybrid Databases
      1. Next-Generation Streaming OLTP Databases
      2. Next-Generation Streaming RTOLAP Databases
      3. Next-Generation HTAP Databases
    8. Summary
  10. 8. Zero-ETL or Near-Zero-ETL
    1. ETL Model
    2. Zero-ETL
    3. Near-Zero-ETL
      1. PeerDB
      2. Proton
      3. Embedded OLAP
      4. Data Gravity and Replication
      5. Analytical Data Reduction
    4. Lambda Architecture
      1. Apache Pinot Hybrid Tables
      2. Pipeline Configurations
    5. Summary
  11. 9. The Streaming Plane
    1. Data Gravity
    2. Components of the Streaming Plane
    3. Streaming Plane Infrastructure
    4. Operational Analytics
    5. Data Mesh
      1. Pillars of a Data Mesh
      2. Challenge of a Data Mesh
    6. Streaming Data Mesh with Streaming Plane and Streaming Databases
      1. Data Locality
      2. Data Replication
    7. Summary
  12. 10. Deployment Models
    1. Consistent Streaming Database
    2. Consistent Streaming Processor and RTOLAP
    3. Eventually Consistent OLAP Streaming Database
    4. Eventually Consistent Stream Processor and RTOLAP
    5. Eventually Consistent Stream Processor and HTAP
    6. ksqlDB
    7. Incremental View Maintenance
    8. Postgres Multicorn Foreign Data Wrapper
    9. When to Use Code-Based Stream Processors
    10. When to Use Lakehouse/Streamhouse Technologies
    11. Caching Technologies
    12. Where to Do Processing and Querying in General?
      1. The Four “Where” Questions
      2. An Analytical Use Case
      3. Consequences
    13. Summary
  13. 11. Future State of Real-Time Data
    1. The Convergence of the Data Planes
    2. Graph Databases
      1. Memgraph
      2. thatDot/Quine
    3. Vector Databases
      1. Milvus 2.x: Streaming as the Central Backbone
      2. RTOLAP Databases: Adding Vector Search
    4. Incremental View Maintenance
      1. pg_ivm
      2. Hydra
      3. Epsio
      4. Feldera
      5. PeerDB
    5. Data Wrapping and Postgres Multicorn
    6. Classical Databases
    7. Data Warehouses
      1. BigQuery
      2. Redshift
      3. Snowflake
    8. Lakehouse
      1. Delta Lake
      2. Apache Paimon
      3. Apache Iceberg
      4. Apache Hudi
      5. OneTable or XTable
      6. The Relationship of Streaming and Lakehouses
    9. Conclusion
  14. Index
  15. About the Authors

Product information

  • Title: Streaming Databases
  • Author(s): Hubert Dulay, Ralph Matthias Debusmann
  • Release date: August 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098154837