Alarms lost in the noise

Adding an alert or alarm for a situation you need to monitor should be a good thing but sometimes, depending on how it is done, it can actually make things worse or at a minimum not work as expected.

In the book Upstream by Dan Heath there is a section about alarms (my emphasis) …

Have you ever rolled your eyes when you heard a fire alarm? That’s alarm fatigue, and it’s a critical problem. A group of researchers studied five ICUs (intensive care units), treating 461 patients, for a month in 2013. There were almost 400,000 audible alarms logged in a month, which broke down to 187 audible alarms per bed per day. When everything is cause for alarm, nothing is cause for alarm. As we design early-warning systems, we should keep these questions in mind:

  • Will the warning give us enough time to act effectively? (If not, why bother?)
  • What rate of false positives can we expect? Our comfort with that level of false positives may, in turn, hinge on the relative cost of handling false positives versus the possibility of missing a real problem.

I had two experiences of alarms during the last week.

We were using an office in a large building in Vienna. Suddenly the fire alarm went off. As we were only using the office for the day we checked whether it was a test or we had to leave the building. It was real so we all evacuated - along with hundreds of other people.

By the time we were outside the first fire engine had turned up. Within five minutes there were at least another nine appliances present. Fortunately they were able to deal with the situation and we could return after just over an hour.

This is an example of an alarm that should always be investigated. They are relatively rare and could involve a risk to life or significant damage to property. Yet, our previous experience of fire alarms meant that there was an element of doubt on whether we needed to act on it.

The second experience was at work. We have recently added a new monitoring tool that alerts us if data is missing. This is something that should never happen but if it happens we would like to know about it as quickly as possible.

The tool got deployed to an environment and we noticed that it was generating an alert every few hours saying there was an issue and then a short while later effectively cancelling that by saying everything was okay.

In investigating the cause we identified that it was due to one part of the workflow not reliably generating all the data required by the next step. This meant that sometimes it would not provide everything needed and hence the tool would generate an alert - even though there was no real underlying problem with the system being monitored.

In this instance we were getting frequent alerts of something important but they were wrong. However this meant that we also didn’t know when a “real” alert was generated as it was lost in the noise. This made the tool effectively useless as this alert would be missed. In the short term we have disabled the monitoring tool while the issue is fixed.

Alarms are important and useful but there must be a high signal to noise ratio otherwise there is a risk they will be ignored and not achieve their aim.

Links

Upstream: How to solve problems before they happen

As an Amazon Associate I earn from qualifying purchases.

Random Posts

Which bird are you?

I am a lark.

In the book Time Wise by Amantha Imber she discusses the three main types of people and their alertness through various times of the day. It impacts when they are most productive for certain types of tasks. This has been written about in numerous other places but she summarises it really well:


Read More

Using AI to find code problems

Ubissoft have introduced a tool that “uses AI” to identify potential coding issues when the developer commits code. They claim it can detect a significant number of errors and even suggest solutions in some cases. There will always be some errors that it will not be able to identify, for example where the implementation doesn’t match the requirements, however this approach could have a significant impact on the amount of time spent debugging. And it should improve over time as it learns more potential errors.


Read More

Reducing Build Size

I have been developing a Maui application as a side project and on the whole it has been a fun process. It is still very buggy and there are some pain points however it seems to work. One aspect I haven’t liked is the size of the build artefacts - they are huge if you have multiple projects.


Read More