Chapter 80. The Yin and Yang of Big Data Scalability

Paul Brebner

Modern big data technologies such as Apache Cassandra and Apache Kafka achieve massive scalability by using clusters of many nodes (servers) to deliver horizontal scalability. Horizontal scaling works by sharing workloads across all the nodes in a cluster by partitioning the data so that each node has a subset of the data, enabling throughput to be increased by simply adding more nodes, and replicating the data on more than one node for reliability, availability, and durability.

Being intrinsically scalable, Cassandra and Kafka are popular open source choices to run low-latency, high-throughput, and high-data-volume applications, which can be easily scaled out. We recently designed, tested, and scaled a demonstration anomaly detection application with Cassandra for the storage layer, Kafka for the streaming layer, and Kubernetes for application scaling. The following figure shows the anomaly detection pipeline’s application design, clusters, and “knobs.”

Scaling is also hard! To get close to linear scalability with increasing nodes, we had to tune multiple software resources to enable the hardware resources to be efficiently utilized. The untuned system achieved 7 billion checks/day, but tuning resulted in a 2.5 times improvement.

The tuning “knobs” control hardware (cluster ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.