Nagios called only once last night.
The cellphone hidden under my pillow made its characteristic sound at 25 past midnight during new years eve, waking me up after a few minutes of sleep. A virtual machine ran out of space after Gitlab automatic backup filled the entire disk. I had to get up and fix the issue before I go back to bed.
Nagios called only once last night, and there’s a lot to learn from it.
The first thing is that Nagios shouldn’t have called last night. I got dragged out of my bed for an issue on a development machine and this should never happen. If there’s one thing I’m sure of about alerting, it’s that it should never get triggered out of office hours unless something’s business critical is broken. A development machine is not business critical, and it should have waited a few hours.
It means being able to set priorities, even in production. A full
/var/ on a computing server is not as critical as a database disk reaching 80%. A crashed frontend (behind a properly configured load balancer) is not as critical as a crashed master SQL server.
The second thing is about rethinking alerting. Thinking your platform alerting from a business point of view means your traditional Nagios is both too much and not enough. It’s too much because 90% of the alerts you’ll get can wait until the next morning, which, by the way, means you get too much of them. It’s not enough because the usual Nagios checks won’t tell you your service is broken, so you need to rely on application specific probes such as Watchmouse or Pingdom.
Third, in 2015, alerting should only get triggered when your core system gets hurt. In other words, most of your system should self heal.
I had a great experience with hosting on Amazon Web Service because of one feature: the Autoscaling Group. Being able to upscale or downscale grapes of node according to their use is awesome. Knowing a broken machine will self replace is even better. It happened a couple of times because the Python process crashed and the load balancer check failed, or because of a kernel panic. It was a relief to find the replacement message in my mailbox in the morning instead of getting awaken by a SMS for restart a fronted process. It means there’s only one thing to worry about: your data stored.
Fourth, this was not the first time this alert was sent. This should not have happened either.
The monitoring sending an alert means something’s wrong. It can be a software or hardware failure, a use issue (like logs filling a hard drive), a scaling issue or something else.
When it happens, you can fix the problem or fix the situation to ensure the probe won’t complain anymore. Deleting the oldest backups brings the probes back to green until they turn red again. Ensuring a lower local data retention fixes it for good. That difference is critical both in terms of architecture and focus. Building for green probes means you can focus on something else, sleeping for example.
Fifth, it was more than just a hard drive being full.
Why did the chicken cross the road? I don’t know, but I know why my hard drive became full. It became full because a local backup filled it. Which means, by the way, that this backup could not perform properly. Which means the data are not backed up (well, they are another way, this backup was redundant, but anyway). Which means there’s a more global risk to think about.
That’s a lot to think about from a simple SMS received during New Year’s Eve, because getting an alert from the monitoring is something that should never happen. And if I was too lazy to actually fix what’s broken, I’m pretty sure my wife will remind me she hates being awaken at night. That’s a better motivation than any uptime contest.