Chapter 3. Alerting
Some people believe that alerting is an art for which proficiency takes long years of trial and error. Perhaps, but most of us can’t wait that long. I prefer to view alerting as an exact science based on logic and probability. It’s about balancing two conflicting objectives: sensitivity, or when to classify an anomaly as problematic, and specificity, or when is it safe to assume that no problem exists. These objectives pull your alerting configuration in two opposite directions. Figuring out the right strategy is not a trivial task, but its effectiveness can be measured. The right choice depends on organizational priorities, the level of recovery built into the monitored system, and the expected impact when things go awry. At any rate, there is nothing supernatural about the process; getting it right is well within everyone’s reach.
The Challenge
In my experience, it’s simply impossible to maintain focused attention on a timeseries in anticipation of a problem. The vast amount of information running through the system generates a great number of timeseries to watch. Hiring people solely for the purpose of watching performance graphs is not very cost effective, and it wouldn’t be a very rewarding job either. Even if it was, though, I’m still not convinced that a human operator would be better at recognizing alertable patterns than a machine.
The process of alerting is full of unstable variables of a qualitative nature, and it presumes an element of responsibility. ...
Get Effective Monitoring and Alerting now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.