Leaping up the reliability ladder - jumping from step 1 to 5 in one giant leap

In 2022 Steve McGhee and James Brookbank from Google published a roadmap for Reliability Engineering — https://r9y.dev/

The roadmap is a simple tech tree that can be implemented in various aspects to achieve the appropriate number of “nines” for your organization.

The roadmap divides the maturity of a reliability engineering team into stages. Those organisations with the tooling and processes described in the lowest tier should be aiming for 90.0% uptime. Organisations with the tooling and processes described in the top tier should be aiming for 99.999% uptime.

The roadmap places anomaly detection in the highest and last stage, aligned with the observability tier which would be deployed by a “99.999 Well engineered business” as part of an autonomic system (a system that provides self-healing and self-protection capabilities) which accepted an unreliability of 5.26 minutes/year.

That is a long road, but it does not have to be. Although anomaly detection will not get you three 9s on its own, it is a component that definitely helps you to get there. One of the reason that it probably sits in the last stage is because the authors are inferring that it needs a lot of the other stages in place before you can implement it, such as telemetry collection, and automated host provisioning and configuration.

Another reason it sits in the last stage is because besides the automated infrastructure and telemetry that are required, it also requires specialized and skilled staff that can undertake implementing and doing anomaly detection and it is unrealistic that smaller orgs, lower down the ladder would have those types of employees on that part of their journey.

Luckily to actually implement anomaly detection you really only need HALF of the first step on the observability tier, HOST METRICS. With host metrics alone you can start to do anomaly detection. It does help if the host metrics are sent by an automated process, but even if your infrastructure is not automated you can still configure your things manually to send the host metrics somewhere.

Ironically it was host metrics that enabled Anomify to climb the reliability engineering ladder. We built our anomaly detection specifically to give us visibility and monitoring on 10s of 1000s of metrics from 430 hosts and their applications, spread across 13 data centers globally and serving up to 6.4 million ad requests per minute with realtime bidding.

We needed to develop a cutting edge internal anomaly detection platform in order to identify and understand changes in our globally distributed ad platform. With 4 different cloud providers and 100s of partners and customers who could all cause significant changes either in error or intentionally (friendly fire) via launching exceptional campaign traffic, publishing an incorrect tab or the Hong Kong data center being network partitioned, anomaly detection was the only technology able to keep tabs on it all. This was the only way to identify, pinpoint and understand vectors of change in a large, global and very dynamic platform, especially with a small ops team of two, and then one.

With Anomify you can jump directly from step 1 on the observabilility reliabiltiy engineering ladder to partially fulfilling step 5! Anomaly detection alone will not give you 3 9s but it will give you information about changes in your things like you were a 3 9s org. A virtual member of your SRE team. No SRE team? Well then call it your own virtual SRE team member that keeps track of all the significant changes for you.

You do not need to be a mature, well engineered business to have and use anomaly detection, you just have to be on the road, at any stage of that journey.

Anomaly detection for everyone.

https://anomify.ai — stay on top of your metrics

Leaping up the reliability ladder jumping from step 1 to 5 in one giant leap