Getting Started with Kafka
Published by Pearson
Building Effective Data Pipelines
- Get hands-on experience with Kafka in just four hours
- Use Python with Kafka to create an end-to-end data flows
- Inspect Kafka data flow examples with a GUI
- Take away a complete copy of the instructor's notes, example code, virtual machine, and class slides to refer to after class
The generation and movement of big data is never constant. In many cases, organizational data flows start with a simple and direct end-to-end connection. While this basic connection model seems manageable, adding more data sources and destinations can easily create an unmaintainable morass of applications and data flows.
Apache Kafka is designed to manage data flow by decoupling the data source from the destination. By placing Kafka in the middle of organizational data flows, Kafka can provide a robust data buffer or broker that can help create and manage data pipelines.
In this training, the basic Kafka data broker design and operation is explained and illustrated using both the command line and a GUI. More advanced examples that include streaming weather and image data for analysis and storage are demonstrated using downloadable virtual machine.
What you’ll learn and how you can apply it
By the end of the live online course, you’ll understand:
- The design and components of the Apache Kafka data broker
- How Kafka manages dataflows using brokers
- How to configure, create topics, and use Kafka as a data broker
- How to write and use Kafka consumers and producers in Python
- How to use Python and Kafka to stream open weather data
- How to use Python and Kafka to stream, store, and analyze images in real-time
And you’ll be able to:
- Understand the benefits and how to use Kafka
- Create basic Kafka producers and consumers
- Write Python applications to work directly with Kafka
- Inspect Kafka data flows in real-time with a GUI
This live event is for you because...
- You want to understand and visualize Apache Kafka and data streaming
- You want to learn the basics of building data pipelines with Kafka
- Hands-on experience is important to you when learning a new technology
- You want a working development environment for use after the training
Prerequisites
- The hands-on portion of the course is done using the Linux command line. The course assumes familiarity using the command line on a modern Linux server.
- Please be aware, if you have no experience with the Linux command line, you may find this course difficult to follow at times. See Recommended Preparation if you need a refresher.
Course Set-up
To run the class examples, a Linux Hadoop Minimal Virtual Machine (VM) is available. The VM is a full Linux installation that can run on your laptop/desktop using VirtualBox (freely available). The VM provides a functional Kafka and Python environment to continue learning after the class (in addition to Hadoop, HBase, Hive, and Spark).
Further information on the class, access to the class notes, and the Linux Hadoop Minimal VM can be found here.
Recommended Preparation
- Watch: Linux Command Line Complete Video Course by Susan Lauber
Recommended Follow-up
- Read: Kafka: The Definitive Guide 2e by Shapira, et al
- Attend: Apache Hadoop, Spark, and Kafka Foundations: Effective Data Pipelines by Doug Eadline
- Watch: Kafka Essentials LiveLessons: A Quick-Start for Building Effective Data Pipelines by Douglas Eadline
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Total workshop time is 4 hours. There will time for questions between segments. Emphasis will be placed on making sure all questions are addressed.
Segment 1: Introduction and Course Goals (15 mins)
- Class Resources and web page
- How to get the most out of this course
- Required prerequisite skills
- Using the Linux Hadoop Minimal virtual machine
Segment 2: Why Do I need a Message Broker? (20 mins)
- Managing data growth
- Decoupling acquisition from use
- Reliability and scalability
- Kafka Use cases
Segment 3: Kafka Components (20 mins)
- Producers and consumers
- Brokers, partitions, and clusters
- Question and Answers (5 mins)
Break (10 mins)
Segment 4: Basic Examples (35 mins)
- Sending messages with producers
- Reading messages with consumers
- Question and Answers (5 mins)
Segment 5: Using a Kafka UI (25 mins)
- Using the KafkaEsque Features
- Replay Basic Examples with KafkaEsque
- Question and Answers (5 mins)
Segment 6: Example One: Streaming Weather Data (35 mins)
- Component Background: Kafka, Python, noaa-sdk
- Using NOAA Data Source with Python
- Python Producer (NOAA data acquisition)
- Python Consumer (Data Storage and analysis)
- Real-time demonstration
- Question and Answers (5 mins)
Break (10 mins)
Segment 7: Example Two: Image Streaming with Kafka (40 mins)
- Component Background: Kafka, Python, Bash
- Configuring Image Streaming to and from Kafka
- Python Producer (image capture)
- Python Consumer (image analysis)
- Real-time demonstration
Segment 8: Course Wrap-up, Questions, and Additional Resources (10 mins)
Your Instructor
Douglas Eadline
Douglas Eadline began his career as an Analytical Chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Scalable Data Analytics (Hadoop/Spark) computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include Hadoop and Spark Fundamentals LiveLessons (Addison Wesley), Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (coauthor, Addison Wesley).