A big part of ensuring the availability of your applications is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does every day here at Google Cloud. By Adrian Hilton, Customer Reliability Engineer, SRE.
The concept of SRE starts with the idea that metrics should be closely tied to business objectives. In addition to business-level SLAs, we also use SLOs and SLIs in SRE planning and practice. The main parts of this article:
- Defining the terms of site reliability engineering
- Service-Level Objective (SLO)
- Service-Level Agreement (SLA)
- Service-Level Indicator (SLI)
SRE begins with the idea that availability is a prerequisite for success. An unavailable system can’t perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to its use as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future. Nice one![Read More]