Big, fast, easy data with KSQL
A look at the new streaming SQL engine for Apache Kafka.
Modern businesses have data at their core, and this data is changing continuously at a rapid pace, with increasing volumes. Stream processing allows businesses to harness this torrent of information in real time, and tens of thousands of companies like Netflix, Uber, Airbnb, PayPal, and The New York Times use Apache Kafka as the streaming platform of choice to reshape their industries. Whether you are booking a hotel or a flight, taking a cab, playing a video game, reading a newspaper, shopping online, or wiring money, many of these daily activities are powered by Kafka behind the scenes.
However, the world of stream processing still has a very high barrier to entry. Today’s most popular stream processing technologies, including Apache Kafka’s Streams API, still require the user to write code in programming languages such as Java or Scala. This hard requirement on coding skills is preventing many companies from unlocking the benefits of stream processing to their full effect. But thankfully, now there is a better way.
Enter KSQL for Kafka
The recently introduced KSQL, the streaming SQL engine for Apache Kafka, substantially lowers the bar to entry for the world of stream processing. Instead of writing a lot of programming code, all you need to get started with stream processing is a simple SQL statement, such as:
SELECT
*
FROM
payments
-
kafka
-
stream
WHERE
fraud_probability
>
0.8
That’s it! And while this might not be immediately obvious, the above streaming query of KSQL is distributed, scalable, elastic, and real time to meet the data needs of businesses today.
Of course, you can do much more with KSQL than I have shown in the simple example above. KSQL is open source (Apache 2.0 licensed) and built on top of Kafka’s Streams API. This means it supports a wide range of powerful stream processing operations, including filtering, transformations, aggregations, joins, windowing, and sessionization. This way you can detect anomalies and fraudulent activities in real time, monitor infrastructure and IoT devices, conduct session-based analysis of user activities, perform real-time ETL, and much more.
Now you may wonder, what can I use KSQL for? Here are some further examples of what you can easily do with it.
Real-time monitoring and real-time analytics
One use of KSQL is defining custom business-level metrics that are computed in real time and off of which you can monitor and alert. For example, showcasing the number of concurrent online players for a triple-A video game franchise (“Are our players engaged? Has the latest game expansion increased playtime?”) or reporting on the number of abandoned shopping carts for an e-commerce website (“Has the latest update to our online store made it easier for customers to checkout?”). Another use is to define a notion of correctness for your business applications in KSQL and then check that they are meeting this as they run in production.
KSQL makes it straightforward to define appropriate metrics off of a stream of raw events, whether these are generated from database updates, applications, mobile devices, or any other kind:
CREATE
TABLE
possibly_failing_vehicles
AS
SELECT
vehicle
,
COUNT
(
*
)
FROM
vehicle_monitoring_stream
WINDOW
TUMBLING
(
SIZE
5
MINUTES
)
WHERE
event_type
=
'ERROR'
GROUP
BY
vehicle
HAVING
COUNT
(
*
)
>
2
;
Online data integration and enrichment
Most of the data processing done in companies falls in the domain of data enrichment: take data coming out of several databases, transform it, join it together, and store it into a key-value store, search index, cache, or other data-serving system. KSQL, when used with Kafka connectors for systems like Oracle, MySQL, Elasticsearch, HDFS, or S3, enables a move from batch data integration to real-time data integration.
As shown in the KSQL query below, you can enrich streams of data with metadata stored in tables using stream-table joins, or do simple filtering of personally identifiable information (PII) before loading the stream into another system.
CREATE
STREAM
vip_users
AS
SELECT
user_id
,
user_country
,
web_page
,
action
FROM
website_clickstream
c
LEFT
JOIN
users
u
ON
u
.
user_id
=
c
.
user_id
WHERE
u
.
level
=
'Platinum'
;
Security and anomaly detection
KSQL queries can transform event streams into numerical time series aggregates that are pumped into systems such as Elastic using the Kafka-Elastic connector and then visualized in a real-time dashboard such as Grafana. Security use cases often look similar to monitoring and analytics. Rather than monitoring application behavior or business behavior, here you’re looking for patterns of fraud, abuse, spam, intrusion, or other bad behavior.
KSQL provides a simple yet sophisticated and real-time method of defining these patterns and querying real-time streams:
CREATE
TABLE
possible_fraud
AS
SELECT
card_number
,
COUNT
(
*
)
FROM
authorization_attempts
WINDOW
TUMBLING
(
SIZE
5
SECONDS
)
GROUP
BY
card_number
HAVING
COUNT
(
*
)
>
3
;
Of streams and databases
Of course, there are many more use cases for KSQL than what I can show in this short article, such as monitoring a fleet of vehicles (“Will a truck need predictive maintenance in the next few days?”) or distributed IoT devices and home automation sensors (“Why is there a surge in temperature on the 2nd floor?”), or analyzing database updates in Oracle in real time. Some creative users are even using KSQL to analyze car racing telemetry data in real time.
But, let’s take a step back from these concrete examples for a moment. In my opinion, what’s even more exciting is that KSQL brings together the worlds of streams (Kafka) and databases (Oracle, MySQL, and friends) by turning the database inside out. In KSQL, similar to Kafka’s Streams API, there are two core data abstractions: the STREAM and the TABLE. They allow you to work with your data in a stream or in a table format. This is important because, in practice, almost every real-time use case you want to implement requires both streams and tables.
Here’s a slightly more than trivial example: as a retailer, you could use KSQL to aggregate a real-time stream of customer activity events in Kafka (purchases, geolocation updates, etc.) into a continuously updated table of customer 360-degree profiles, joined with other internal and external information about those customers. This consolidated customer profile table can then power applications such as detecting fraudulent payments in a stream of financial transactions via KSQL or Kafka’s Streams API, or its data can be streamed in real time via Kafka’s Connect framework and ready-to-use connectors into traditional RDBMS such as Oracle, PostgreSQL, or MySQL that are part of your existing infrastructure. And all of this is real time, fault tolerant, and done at scale, thanks to the strong technical foundation of Apache Kafka, the distributed streaming platform.
Where to go from here
I encourage you to take KSQL for a spin yourself! It will take only a few minutes to get up and running with your first use case.
Related:
- Kafka: The Definitive Guide (book)
- Learning Path: Kafka (video)
- Managing Enterprise Data Strategies with Hadoop, Spark, and Kafka (live online training session, March 16, 2018)