Big, fast, easy data with KSQL

A look at the new streaming SQL engine for Apache Kafka.

By Michael Noll

February 28, 2018

Streaming (source: Maxime VALCARCE on Unsplash)

Modern businesses have data at their core, and this data is changing continuously at a rapid pace, with increasing volumes. Stream processing allows businesses to harness this torrent of information in real time, and tens of thousands of companies like Netflix, Uber, Airbnb, PayPal, and The New York Times use Apache Kafka as the streaming platform of choice to reshape their industries. Whether you are booking a hotel or a flight, taking a cab, playing a video game, reading a newspaper, shopping online, or wiring money, many of these daily activities are powered by Kafka behind the scenes.

However, the world of stream processing still has a very high barrier to entry. Today’s most popular stream processing technologies, including Apache Kafka’s Streams API, still require the user to write code in programming languages such as Java or Scala. This hard requirement on coding skills is preventing many companies from unlocking the benefits of stream processing to their full effect. But thankfully, now there is a better way.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Enter KSQL for Kafka

The recently introduced KSQL, the streaming SQL engine for Apache Kafka, substantially lowers the bar to entry for the world of stream processing. Instead of writing a lot of programming code, all you need to get started with stream processing is a simple SQL statement, such as:

SELECT * FROM payments-kafka-stream WHERE fraud_probability > 0.8

That’s it! And while this might not be immediately obvious, the above streaming query of KSQL is distributed, scalable, elastic, and real time to meet the data needs of businesses today.

Of course, you can do much more with KSQL than I have shown in the simple example above. KSQL is open source (Apache 2.0 licensed) and built on top of Kafka’s Streams API. This means it supports a wide range of powerful stream processing operations, including filtering, transformations, aggregations, joins, windowing, and sessionization. This way you can detect anomalies and fraudulent activities in real time, monitor infrastructure and IoT devices, conduct session-based analysis of user activities, perform real-time ETL, and much more.

Now you may wonder, what can I use KSQL for? Here are some further examples of what you can easily do with it.

Real-time monitoring and real-time analytics

One use of KSQL is defining custom business-level metrics that are computed in real time and off of which you can monitor and alert. For example, showcasing the number of concurrent online players for a triple-A video game franchise (“Are our players engaged? Has the latest game expansion increased playtime?”) or reporting on the number of abandoned shopping carts for an e-commerce website (“Has the latest update to our online store made it easier for customers to checkout?”). Another use is to define a notion of correctness for your business applications in KSQL and then check that they are meeting this as they run in production.

KSQL makes it straightforward to define appropriate metrics off of a stream of raw events, whether these are generated from database updates, applications, mobile devices, or any other kind:

CREATE TABLE possibly_failing_vehicles AS
   SELECT vehicle, COUNT(*)
   FROM vehicle_monitoring_stream
   WINDOW TUMBLING (SIZE 5 MINUTES)
   WHERE  event_type = 'ERROR'
   GROUP BY vehicle
   HAVING COUNT(*) > 2;

Online data integration and enrichment

Most of the data processing done in companies falls in the domain of data enrichment: take data coming out of several databases, transform it, join it together, and store it into a key-value store, search index, cache, or other data-serving system. KSQL, when used with Kafka connectors for systems like Oracle, MySQL, Elasticsearch, HDFS, or S3, enables a move from batch data integration to real-time data integration.

As shown in the KSQL query below, you can enrich streams of data with metadata stored in tables using stream-table joins, or do simple filtering of personally identifiable information (PII) before loading the stream into another system.

CREATE STREAM vip_users AS
   SELECT user_id, user_country, web_page, action
   FROM website_clickstream c
   LEFT JOIN users u ON u.user_id = c.user_id
   WHERE u.level = 'Platinum';

Security and anomaly detection

KSQL queries can transform event streams into numerical time series aggregates that are pumped into systems such as Elastic using the Kafka-Elastic connector and then visualized in a real-time dashboard such as Grafana. Security use cases often look similar to monitoring and analytics. Rather than monitoring application behavior or business behavior, here you’re looking for patterns of fraud, abuse, spam, intrusion, or other bad behavior.

KSQL provides a simple yet sophisticated and real-time method of defining these patterns and querying real-time streams:

CREATE TABLE possible_fraud AS
   SELECT card_number, COUNT(*)
   FROM authorization_attempts
   WINDOW TUMBLING (SIZE 5 SECONDS)
   GROUP BY card_number
   HAVING COUNT(*) > 3;

Of streams and databases

Of course, there are many more use cases for KSQL than what I can show in this short article, such as monitoring a fleet of vehicles (“Will a truck need predictive maintenance in the next few days?”) or distributed IoT devices and home automation sensors (“Why is there a surge in temperature on the 2nd floor?”), or analyzing database updates in Oracle in real time. Some creative users are even using KSQL to analyze car racing telemetry data in real time.

But, let’s take a step back from these concrete examples for a moment. In my opinion, what’s even more exciting is that KSQL brings together the worlds of streams (Kafka) and databases (Oracle, MySQL, and friends) by turning the database inside out. In KSQL, similar to Kafka’s Streams API, there are two core data abstractions: the STREAM and the TABLE. They allow you to work with your data in a stream or in a table format. This is important because, in practice, almost every real-time use case you want to implement requires both streams and tables.

Here’s a slightly more than trivial example: as a retailer, you could use KSQL to aggregate a real-time stream of customer activity events in Kafka (purchases, geolocation updates, etc.) into a continuously updated table of customer 360-degree profiles, joined with other internal and external information about those customers. This consolidated customer profile table can then power applications such as detecting fraudulent payments in a stream of financial transactions via KSQL or Kafka’s Streams API, or its data can be streamed in real time via Kafka’s Connect framework and ready-to-use connectors into traditional RDBMS such as Oracle, PostgreSQL, or MySQL that are part of your existing infrastructure. And all of this is real time, fault tolerant, and done at scale, thanks to the strong technical foundation of Apache Kafka, the distributed streaming platform.

Where to go from here

I encourage you to take KSQL for a spin yourself! It will take only a few minutes to get up and running with your first use case.

Related:

Kafka: The Definitive Guide (book)
Learning Path: Kafka (video)
Managing Enterprise Data Strategies with Hadoop, Spark, and Kafka (live online training session, March 16, 2018)

Post topics: Data science