Mastering Apache Pulsar

Book description

Every enterprise application creates data, including log messages, metrics, user activity, and outgoing messages. Learning how to move these items is almost as important as the data itself. If you're an application architect, developer, or production engineer new to Apache Pulsar, this practical guide shows you how to use this open source event streaming platform to handle real-time data feeds.

Jowanza Joseph, staff software engineer at Finicity, explains how to deploy production Pulsar clusters, write reliable event streaming applications, and build scalable real-time data pipelines with this platform. Through detailed examples, you'll learn Pulsar's design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the load manager, and the storage layer.

This book helps you:

  • Understand how event streaming fits in the big data ecosystem
  • Explore Pulsar producers, consumers, and readers for writing and reading events
  • Build scalable data pipelines by connecting Pulsar with external systems
  • Simplify event-streaming application building with Pulsar Functions
  • Manage Pulsar to perform monitoring, tuning, and maintenance tasks
  • Use Pulsar's operational measurements to secure a production cluster
  • Process event streams using Flink and query event streams using Presto
  • Publisher resources

    View/Submit Errata

    Table of contents

    1. Preface
      1. Why I Wrote This Book
      2. Who This Book Is For
      3. How I Organized This Book
      4. Conventions Used in This Book
      5. Using Code Examples
      6. O’Reilly Online Learning
      7. How to Contact Us
      8. Acknowledgments
    2. 1. The Value of Real-Time Messaging
      1. Data in Motion
      2. Resource Efficiency
      3. Interesting Applications
        1. Banking
        2. Medical
        3. Security
        4. Internet of Things
      4. Summary
    3. 2. Event Streams and Event Brokers
      1. Publish/Subscribe
      2. Queues
      3. Failure Modes
      4. Push Versus Poll
      5. The Need for Pulsar
        1. Unification
        2. Modularity
        3. Performance
      6. Summary
    4. 3. Pulsar
      1. Origins of Pulsar
      2. Pulsar Design Principles
        1. Multitenancy
        2. Geo-Replication
        3. Performance
        4. Modularity
      3. Pulsar Ecosystem
        1. Pulsar Functions
        2. Pulsar IO
        3. Pulsar SQL
      4. Pulsar Success Stories
        1. Yahoo! JAPAN
        2. Splunk
        3. Iterable
      5. Summary
    5. 4. Pulsar Internals
      1. Brokers
        1. Message Cache
        2. BookKeeper and ZooKeeper Communication
        3. Schema Validation
        4. Inter-Broker Communication
        5. Pulsar Functions and Pulsar IO
      2. Apache BookKeeper
        1. Write-Ahead Logging
        2. Message Storing
        3. Object/Blob Storage
        4. Pravega
        5. Majordodo
      3. Apache ZooKeeper
        1. Naming Service
        2. Configuration Management
        3. Leader Election
        4. Notification System
        5. Apache Kafka
        6. Apache Druid
      4. Pulsar Proxy
      5. Java Virtual Machine (JVM)
        1. Netty
        2. Apache Spark
        3. Apache Lucene
      6. Summary
    6. 5. Consumers
      1. What Does It Mean to Be a Consumer?
      2. Subscriptions
        1. Exclusive
        2. Shared
        3. Key_Shared
        4. Failover
      3. Acknowledgments
        1. Individual Ack
        2. Cumulative Ack
      4. Schemas
        1. Consumer Schema Management
      5. Consumption Modes
        1. Batching
        2. Chunking
      6. Advanced Configuration
        1. Delayed Messages
        2. Retention Policy
        3. Backlog Quota
      7. Configuring a Consumer
        1. Replay
        2. Dead Letter Topics
        3. Retry Letter Topics
      8. Summary
    7. 6. Producers
      1. Synchronous Producers
      2. Asynchronous Producers
      3. Producer Routing
        1. Round-Robin Routing
        2. Single Partition Routing
        3. Custom Partition Routing 
      4. Producer Configuration
        1. topicName
        2. producerName
        3. sendTimeoutMs
        4. blockIfQueueFull
        5. maxPendingMessages
        6. maxPendingMessagesAcrossPartitions
        7. messageRoutingMode
        8. hashingScheme
        9. cryptoFailureAction
        10. batchingMaxPublishDelayMicros
        11. batchingMaxMessages
        12. batchingEnabled
        13. compressionType
      5. Schema on Write
        1. Using the Schema Registry
      6. Nonpersistent Topics
        1. Use Cases
        2. Using Nonpersistent Topics
      7. Transactions
      8. Summary
    8. 7. Pulsar IO
      1. Pulsar IO Architecture
        1. Runtime
        2. Performance Considerations
      2. Use Cases
        1. Simple Event Processing Pipelines
        2. Change Data Capture
      3. Considerations
        1. Message Serialization
        2. Pipeline Stability
        3. Failure Handling
      4. Examples
        1. Elasticsearch
        2. Netty
      5. Writing Your Connector
        1. TimescaleDB
      6. Summary
    9. 8. Pulsar Functions
      1. Stream Processing
      2. Pulsar Functions Architecture
        1. Runtime
        2. Isolation
      3. Isolation with Kubernetes Function Deployments
      4. Use Cases
        1. Creating Pulsar Functions
        2. Simple Event Processing
        3. Topic Hygiene
        4. Topic Accounting
      5. Summary
    10. 9. Tiered Storage
      1. Storing Data in the Cloud
        1. Object Storage
      2. Use Cases
        1. Replication
        2. CQRS
        3. Disaster Recovery
      3. Offloading Data
        1. Pulsar Offloaders
      4. Retrieving Offloaded Data
        1. Interacting with Object Store Data
        2. Repopulating Topics
        3. Utilizing Pulsar Client
      5. Summary
    11. 10. Pulsar SQL
      1. Streams as Tables
      2. SQL-on-Anything Engines
        1. Apache Flink: An Alternative Perspective
        2. Presto/Trino
      3. How Pulsar SQL Works
      4. Configuring Pulsar SQL
      5. Performance Considerations
      6. Summary
    12. 11. Deploying Pulsar
      1. Docker
      2. Bare Metal
        1. Minimum Requirements
        2. Getting Started
        3. Deploying ZooKeeper
        4. Starting BookKeeper
        5. Starting Pulsar
      3. Public Cloud Providers
        1. AWS
        2. Azure
        3. Google Cloud Platform
      4. Kubernetes
      5. Summary
    13. 12. Operating Pulsar
      1. Apache BookKeeper Metrics
        1. Server Metrics
        2. Journal Metrics
        3. Storage Metrics
      2. Apache ZooKeeper Metrics
        1. Server Metrics
        2. Request Metrics
      3. Topic Metrics
      4. Consumer Metrics
      5. Pulsar Transaction Metrics
      6. Pulsar Function Metrics
      7. Advanced Operating Techniques
        1. Interceptors and Tracing
        2. Pulsar SQL Metrics
      8. Metrics Forwarding
        1. Dashboards
      9. Summary
    14. 13. The Future
      1. Programming Language Support
        1. Extension Interface
        2. Enhancements to Pulsar Functions
        3. Architectural Simplification/Expansion
        4. Messaging Platform Bridges
      2. Summary
    15. A. Pulsar Admin API
      1. Use Cases
      2. Examples
        1. Creating a Partitioned Topic
        2. Deleting a Partitioned Topic
        3. Creating a Namespace with Specific Policies
        4. Deleting a Namespace
      3. Summary
    16. B. Pulsar Admin CLI
      1. CLI API
      2. Examples
        1. Creating a Partitioned Topic
        2. Creating a Pulsar IO Source
        3. Creating a Pulsar IO Sink
        4. Uploading a Schema
        5. Deleting a Schema
        6. Creating a Namespace
        7. Deleting a Namespace
      3. Summary
    17. C. Geo-Replication
      1. Synchronous Replication
      2. Asynchronous Replication
      3. Replication Patterns
        1. Mesh
        2. Aggregation
        3. Standby
        4. Admin- and Producer-Level Control
      4. Summary
    18. D. Security, Authentication, and Authorization in Pulsar
      1. Encryption in Transit
      2. Encryption at Rest
      3. Authentication
      4. Authorization
      5. Summary
    19. Index
    20. About the Author

    Product information

    • Title: Mastering Apache Pulsar
    • Author(s): Jowanza Joseph
    • Release date: December 2021
    • Publisher(s): O'Reilly Media, Inc.
    • ISBN: 9781492084907