FotolEdhar - Fotolia
Collecting lots of DevOps data from an infrastructure and environment is mostly a good thing. Historically, unused data was considered to have a negative ROI. Those big printouts that no one read cost money to produce! But today, we have modern logging stacks like Elasticsearch, Fluentd and Kibana (EFK) and machine-learning algorithms, so capturing all the data routinely pays off in both root cause analysis and predictive analytics.
Not all DevOps data is equal, however, and every deviation from normal operations should not be taken as a reason to set off a cacophony of alarms and data alerts.
I have four rules for alarms:
1. An alarm should be exciting.
Alarms for events that are actually part of an expected routine lead to alarm fatigue. Think of a fire alarm that goes off daily for random reasons. Is anyone going to take it seriously when it goes off because there's an actual fire?
2. A DevOps data alert should indicate that something needs to be fixed. Now.
I'd like to highlight a subtle distinction. I didn't say that something bad happened. I didn't say that there was a worrisome trend. I said that something needs to be immediately fixed.
Consider, a major software service in your environment goes down overnight, but it automatically fails over to a backup, successfully restarts and stays up. Sounds like a discussion topic at your team's morning meeting. Root cause analysis is almost certainly in order. But there probably isn't a good reason to wake up an engineer in the middle of the night to inform them that there was a problem and their properly designed recovery system worked as planned.
3. No ambers, an alarm should only be red.
To be clear, we use gradation in severity many places. There are critical security patches and not-so-critical security patches. We might be almost out of some computing resource, or we might just be starting to run low.
It's fine to use amber as a "keep your eye on this" sort of thing in dashboards and other indicators. But not for alarms. Either it's something that needs to be fixed now or it isn't. If the data alert means that some action needs to be taken, it might as well be red. No one likes an indecisive alarm.
4. An alarm must reach the right people.
If an alarm means that something needs to be immediately fixed, it stands to reason that the person getting the alarm should be in the position to do the fixing. (Or at least to figure out and contact who can.)
A corollary is that the person (or people, for redundancy) receiving the alert should understand that they have the responsibility to respond. I'm sure everyone is familiar with email blasts in which all the recipients think they're on the list for informational purposes and no one takes whatever action is being requested.
Healthcare has reached a similar diagnosis
Alarm fatigue is a genuine problem in healthcare. For example, the American Association of Critical-Care Nurses found that 80% to 99% of ECG monitor alarms are false or clinically insignificant. A 2011 Boston Globe investigation found that "more than 200 hospital patients nationwide whose deaths between January 2005 and June 2010 were linked to problems with alarms on patient monitors that track heart function, breathing, and other vital signs."
In short, not all that data should be considered as a metric for overall success. And data alerts are for the unusual and the urgent. By all means, collect it and implement dashboards that display a broad range of that DevOps data. But think about your data alerts carefully and systematically.
Learn more about how to monitor your infrastructure with the user in mind. With the right tools and commitment, you'll maintain the best end-user experience.