Chapter 81. Threading and Concurrency in Data Processing

Matthew Housley, PhD

In a typical modern environment, data flows through complex distributed systems at each stage of a data pipeline, and servers must reliably maintain numerous connections to their peers. The Amazon Kinesis outage in late 2020 illustrates the consequences of ignoring concurrency problems.

The outage, which occurred on November 25, took down a large swath of the internet. In this discussion, I’ll assume that you’ve read Amazon’s official report. The report uncovers several engineering blunders, but we will focus primarily on the limits of thread-based concurrency.

Operating System Threading

Each frontend server creates operating system threads for each of the other servers in the frontend fleet.

In Linux and most other modern operating systems, threads are a mechanism for allowing a CPU to execute a number of concurrent tasks far exceeding the number of available CPU cores. A thread gets access to a CPU core during an allotted time; when time runs out, thread state is dumped to system memory, and the next thread is loaded into the core. Rapid cycling allows the operating system to maintain the illusion that all threads are running simultaneously. Threads provide an easy mechanism for managing many network connections.

Threading Overhead

The new capacity had caused all of the servers in the fleet to ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.