Chapter 41. Know Your Latencies
Dhruba Borthakur
Every data system has three characteristics that uniquely identify it: the size of the data, the recency of the data, and the latency of queries on that data. You are probably familiar with the first one, but the other two are sometimes an afterthought.
As a data engineer, I have frequently deployed a big data system for one use case. Then a new user uses the same data system for a different use case and complains, “Oh, my query latencies are slower than my acceptable limit of 500 milliseconds” or “My query is not finding data records that were produced in the most recent 10 seconds.”
At the very outset of engineering a data system, the three things that I ask myself are as follows:
- What is my data latency?
- The data latency can vary widely. An annual budgeting system would be satisfied if it had access to all of last month’s data and earlier. Similarly, a daily reporting system will probably be happy if it can get access to the most recent 24 hours of data. An online software-gaming leaderboard application would be satisfied with analyzing data that is produced in the most recent 1 second and earlier.
- What is my query latency?
- If I am building a daily reporting system, I can afford to build a system that is optimized for overall throughput. The latency of a query could take a few minutes or even a few hours, because I need to produce ...
Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.