Monitoring isn’t just an operations task, the responsibility should be shared. Get development teams monitoring their own codebases so that they can react when things change unexpectedly.
People won’t act on alerts they don’t understand so alerts need owners otherwise they fall through the cracks and don’t get actioned. To someone without domain knowledge, alert channels in Slack or Teams can look like a series of graphs with an unintelligible string of characters breaking them up. In order to act an employee needs to understand what the alert represents and needs to be responsible for the health of the system behind the alert.
Development teams are well placed to monitor their own code bases because they understand the moving parts of their service and can add telemetry that illuminates change. With this telemetry in place, the next time an operations alert is triggered alongside one from a service, it might be easier to figure out what went wrong.
2. Help your future self
Try the following things to get the most out of your alerts:
Set alerts on key metrics, not all your metric namespaces. You have a limited capacity, both in time and energy, to respond to alerts so make sure they are set up to track important metrics. If you don’t know which metrics are important then you probably shouldn’t be setting up alerts.
Make the alert easy to identify in a list - Add a clear name, and possibly a description detailing possible causes for the change. Steer clear of adding possible resolution to the description. User emojis for different applications or teams.
If using Slack configure the alert to @ mention users who can take action that way they get a notification when the alert fires. Resolution can take place in the thread below the alert.
3. Decide what happens when there is no signal
You have a problem when your time-series database stops receiving data, only you won’t know about it because your alerts are configured to alert on changes to metric behaviour and not lack of metric behaviour.
To solve the problem, set up an alert that triggers when there is no signal from your metrics.
You don’t want the alert to trigger any time a datapoint fails to be received by the TSDB. Metrics occasionally miss the odd datapoint, perhaps some process fails on the machine responsible for sending metrics and a cycle is lost. It’s worth having a backfill mechanism in place to resend data-points that weren’t received by the TSDB.
Instead set up the alert to fire after no data has been received for a period of time. This ‘wait time’ specifies how long the system should wait for the metrics to come back online before sending an alert. If metrics come back online within the wait time then an alert isn’t sent.
4. Assess your alert rules every few months and cut out redundant rules
Like broken vacuum cleaners in a student house, alert rules build up over time and need triaging. When people leave your company they don’t take their alerts rules with them. They just sit there configured until someone decides to remove them.
Some metrics change their patterns of behaviour and alerts designed to identify changes in the old pattern are left on the shelf never to fire again. It’s healthy to engage in some metaphorical spring cleaning regularly to maintain order to get rid of out of date alert rules.
5. Replace static with dynamic thresholds in most cases
Static thresholds are useful for notifying you when a metric falls below a particular value but they won’t pick up cases where an unexpected change occurs within the normal range of the metric.
Metrics that exhibit seasonal changes don’t play well with static thresholds either. Thresholds need updating each time the metric adopts a new pattern of normal behaviour. Fail to update the threshold and you risk having a limit that is too far off the metric range to ever trigger or an limit that runs so close to the normal range of the metric that it pings off false positive notifications and requires constant tweaking.
Percentage based thresholds, which trigger when a metric has deviated by a certain percentage, can respond better to seasonal changes. The trade off is that it can take a long time for a metric to change enough to reach the threshold.
Manually managing thresholds for thousands of metrics is a tedious and expensive operational overhead. Dynamic thresholds don’t need tweaking since they update to automatically take account of normal changes in metric behaviour.
Share This Post On
Signup for product updates
Emails sent occasionally. Unsubscribe anytime.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.