Let's step through the terms you need to know.
A time-series is a sequence of measurements collected from one source at different points in time and ordered chronologically.
A metric is time-series data that is used to track the performance of a system. Measurements are taken at regular intervals and new data-points are added to the series in real time.
Anomaly detection is the process of identifying outliers or unexpected patterns in time-series data.
Anomaly detection analysis is often run against a static dataset long after it has been collected. Post hoc analysis of events can shed light on why something happened after it occurred but it doesn’t much help with identifying anomalous events when they happen. Unexplained changes can impact your business negatively so the sooner you know about them the better.
Real-time anomaly detection involves running anomaly detection analysis on metric data each time a new measurement is collected. At Anomify machine learning algorithms analyse metrics in order to define a baseline of expected behaviour. New data-points that deviate from this baseline are classified as anomalies. The baseline is updated through a process of semi-supervised learning.
Semi-supervised machine learning is a method for classifying unlabeled data. It involves some human intervention to guide machine learning algorithms (supervised learning) and some self directed machine analysis (unsupervised learning).
Unsupervised learning is another machine learning method, requiring no human intervention, but is a poor fit for the anomaly detection problem space because it fails to take account of real-world context and produces more false positives.
Observability has achieved buzzword status in recent years, but behind the hype it is a mindset for monitoring software systems. The idea is to maximise the observable space within a software system so that you can ask any question about its current state.
For distributed IT systems, observability involves integrating logs, metrics, traces and profiling data into the moving parts of the software system. Then human ingenuity and developer tooling is required to ask meaningful questions of your system, perform root cause analysis and pick out the relationships between data points.
Anomaly detection can help monitoring professionals make sense of their observability data by bubbling up unexpected signals when they are debugging complex issues.
SRE is a monitoring paradigm pioneered at Google that has taken off in recent years. It involves first establishing a set of service-level metrics which can determine business value and then taking calculated risks to improve those metrics.
You have a budget for taking calculated risks. Rather than aim for 100% uptime, you aim for 99.9% which gives you 0.01% of allowed downtime. This permitted downtime is characterised as an error budget. While you have error budget remaining, you try to optimise your system to improve service level metrics and risk breaking things and burning down your error budget. Conversely when the error budget is falling, i.e the system is down, you are firefighting to restore it and to stabilise your error budget.
Anomaly detection can illuminate optimisation issues by flagging metrics that changed unexpectedly. This process helps identify unknown issues to work on because it runs across your entire metric set and not just against metrics you have alerts set against.
When you’re firefighting anomaly detection can provide a snapshot of change points in your system, giving you a birds eye view of abnormal behaviour. No need to rifle through dashboards to resolve complex issues when you’re pressed for time and multiple architecture layers are involved.
Anomify adds an additional dimension to your monitoring/observability set up. Where engineers typically define rules and thresholds alerts on specific metrics, Anomify monitors all your metrics in the background, keeping an eye on everything. There is no need for manual threshold tweaking.
Rules, thresholds and SLO (Service Level Objective) calculations do not necessarily aid in pinpointing what changed and where – in fact they exclude most metrics. Anomify monitors all metrics and identifies and records abnormal changes, giving you deep insights into your systems and applications when you need it.
Most anomaly detection platforms use unsupervised learning, which creates a disconnect between the user and the model making the anomaly assessment. Anomify’s transparent supervision provides a human explanation for the predictions it makes. The analysis can be trained to fit with your mental model of how the system should behave under normal conditions.
Artificial intelligence for IT operations is the application of machine learning models on IT operations tasks. AIOPs assists SRE and DevOps professionals by providing suggestions that remove toil, and speed up resolution of issues in any one of the following areas:
The field of AIOPs is in its infancy and tooling companies are still working out how best to serve their users. It’s important for the AI component to be seen as an aid, a helping hand in an otherwise human process. Humans are better than machines at drawing conclusions across different systems but AI is adept at ‘bubbling up’ information or context to streamline human decision making.