Monitoring isn’t just an operations task, the responsibility should be shared. Get development teams monitoring their own codebases so that they can react when things change unexpectedly.
People won’t act on alerts they don’t understand so alerts need owners otherwise they fall through the cracks and don’t get actioned. To someone without domain knowledge, alert channels in Slack or Teams can look like a series of graphs with an unintelligible string of characters breaking them up. In order to act an employee needs to understand what the alert represents and needs to be responsible for the health of the system behind the alert.
Development teams are well placed to monitor their own code bases because they understand the moving parts of their service and can add telemetry that illuminates change. With this telemetry in place, the next time an operations alert is triggered alongside one from a service, it might be easier to figure out what went wrong.
Try the following things to get the most out of your alerts:
You have a problem when your time-series database stops receiving data, only you won’t know about it because your alerts are configured to alert on changes to metric behaviour and not lack of metric behaviour.
To solve the problem, set up an alert that triggers when there is no signal from your metrics.
You don’t want the alert to trigger any time a datapoint fails to be received by the TSDB. Metrics occasionally miss the odd datapoint, perhaps some process fails on the machine responsible for sending metrics and a cycle is lost. It’s worth having a backfill mechanism in place to resend data-points that weren’t received by the TSDB.
Instead set up the alert to fire after no data has been received for a period of time. This ‘wait time’ specifies how long the system should wait for the metrics to come back online before sending an alert. If metrics come back online within the wait time then an alert isn’t sent.
Like broken vacuum cleaners in a student house, alert rules build up over time and need triaging. When people leave your company they don’t take their alerts rules with them. They just sit there configured until someone decides to remove them.
Some metrics change their patterns of behaviour and alerts designed to identify changes in the old pattern are left on the shelf never to fire again. It’s healthy to engage in some metaphorical spring cleaning regularly to maintain order to get rid of out of date alert rules.
Static thresholds are useful for notifying you when a metric falls below a particular value but they won’t pick up cases where an unexpected change occurs within the normal range of the metric.
Metrics that exhibit seasonal changes don’t play well with static thresholds either. Thresholds need updating each time the metric adopts a new pattern of normal behaviour. Fail to update the threshold and you risk having a limit that is too far off the metric range to ever trigger or an limit that runs so close to the normal range of the metric that it pings off false positive notifications and requires constant tweaking.
Percentage based thresholds, which trigger when a metric has deviated by a certain percentage, can respond better to seasonal changes. The trade off is that it can take a long time for a metric to change enough to reach the threshold.
Manually managing thresholds for thousands of metrics is a tedious and expensive operational overhead. Dynamic thresholds don’t need tweaking since they update to automatically take account of normal changes in metric behaviour.