Review: Effective Monitoring & Alerting

Effective Monitoring and Alerting by Slawek Ligus was not quite the book I had originally thought I was starting. I was looking for something more prescriptive and I read something far higher up the stack. It wasn’t a bad book, but it was something different from what I went in expecting.

The book takes a high-level look at how to keep alerting from getting out of hand. That is the overall message they are trying to get across. Here is the overall message:

  • You need to make sure you monitor the proper things in the proper way. This brings about a deep understanding of the system as a whole and also forces you to really figure out what dependencies the distinct of parts of your system have in order to be certain you are monitoring things that matter.
  • Armed with that information, you move onto mapping out what should be shooting off alerts. This gets directly to the data about dependencies because we want to be certain that we alert only on the parts of the system that are failing, not on those parts dependent on the failed area.
  • The entire idea is to make sure that the alerts getting sent out are needed and useful. There is talk of standardizing the names of the systems and alerts so you can know exactly what is happening right from the start.
  • There is a huge focus on making sure the alerts are truly actionable and needed so that you don’t give your IT operations staff alert fatigue. The idea is to alert on things that can and need to be fixed and on nothing else.
  • This means monitoring everything but alerting on just a small subset. You can use the monitoring data for capacity planning and also trying to find issues before they start, but you will constantly be changing the alerting thresholds so that only the most important ones are sent through.

That’s the overall look. As far as this review goes, it comes down to this: I would definitely read it again, but be aware of what the book is going to be about. It is NOT prescriptive at all, but it is short enough to be useful even for the smallest of operations department.