Availability
Availability is a measure of a system's operational uptime and accessibility to users over a period. It is a key metric in reliability engineering, often expressed as a percentage (e.g., 99.999% or 'five nines'). It quantifies the probability that a system will be functioning and able to deliver its services at any given time. High availability is a primary goal in designing robust and dependable systems, especially for critical services.
1960s-1970s
2
Definitions
Core Concept in System Design
Availability is a fundamental quality attribute that measures the percentage of time a system or service is operational and accessible to perform its required function. It is a primary metric for assessing the resilience and dependability of a system.
It is mathematically expressed as:
Availability = Uptime / (Uptime + Downtime)
This is often communicated using the 'nines' system, which provides a clear understanding of the expected downtime for a service:
- 99% (Two nines): ~7.3 hours of downtime per month.
- 99.9% (Three nines): ~43.8 minutes of downtime per month.
- 99.99% (Four nines): ~4.38 minutes of downtime per month.
- 99.999% (Five nines): ~26.3 seconds of downtime per month.
Achieving higher levels of availability involves significant trade-offs. While a five-nines system offers exceptional uptime, it requires greater complexity, redundancy, and cost compared to a three-nines system. Factors impacting availability include hardware failures, software bugs, network outages, deployment errors, and external dependencies.
Strategies for High Availability (HA)
High Availability (HA) refers to the architectural principles and practices aimed at ensuring a system achieves a desired level of operational readiness. The goal is to minimize or eliminate downtime by building resilience against failures. Key strategies include:
-
Redundancy: This is the core principle of HA. It involves duplicating critical components of the system (e.g., servers, databases, power supplies, network links). If one component fails, a redundant one can take its place, ensuring continuous serviceability.
-
Failover: This is the mechanism that automatically switches from a failing component to a redundant one. Failover can be active-passive (a standby component is activated upon failure) or active-active (all components are operational and share the load).
-
Fault Detection: Systems must be able to detect failures promptly to initiate a failover. This is often achieved through health checks or 'heartbeat' mechanisms where components regularly signal their operational status.
-
Load Balancing: Distributes incoming requests across a cluster of redundant servers. This not only improves performance and scalability but also enhances availability by routing traffic away from servers that have failed or are unresponsive.
Origin & History
Etymology
Derived from the word 'available', meaning 'capable of being used or obtained'. In a technical context, it refers to the state of a system being ready and accessible for its intended function.
Historical Context
The concept of **availability** became a significant concern with the advent of mainframe computers in the 1960s and 1970s. Businesses relied on these systems for critical operations, making their **uptime** essential. A major milestone was the introduction of Tandem Computers' NonStop systems in 1976, which were specifically designed for fault tolerance and high **availability**. The rise of the internet in the 1990s transformed **availability** from a business-centric concern to a public expectation. E-commerce sites and online services needed to be accessible 24/7, making **operational readiness** a competitive advantage. This era saw the popularization of techniques like load balancing and server clustering. With the emergence of cloud computing in the 2000s, high **availability** became a commoditized feature. Cloud providers like AWS, Azure, and Google Cloud Platform built global infrastructure with built-in redundancy, such as Availability Zones and Regions, allowing developers to build highly available systems more easily and cost-effectively than ever before.
Usage Examples
Our Service Level Agreement (SLA) promises customers 99.99% availability, which dictates our entire operational strategy.
To increase the application's availability, the engineering team implemented a multi-region failover system.
The recent hardware failure severely impacted the database's uptime, leading to a breach in our guaranteed serviceability for the quarter.
Frequently Asked Questions
How is availability typically measured and expressed?
Availability is typically measured as a percentage of uptime over a total period of time. The formula is:
Availability = (Total Time - Downtime) / Total Time * 100%
It is commonly expressed using the 'nines' notation:
- 99% (Two Nines): Allows for about 3.65 days of downtime per year.
- 99.9% (Three Nines): Allows for about 8.77 hours of downtime per year.
- 99.99% (Four Nines): Allows for about 52.6 minutes of downtime per year.
- 99.999% (Five Nines): Allows for about 5.26 minutes of downtime per year.
This metric is a critical component of Service Level Agreements (SLAs).
What is the difference between availability and reliability?
While related, they measure different aspects of a system's performance.
Availability is the probability that a system is operational and accessible at a specific point in time. It answers the question, 'Is the system up right now?' A system can have high availability even if it fails frequently, as long as it recovers very quickly.
Reliability is the probability that a system will perform its intended function without failure for a specified period. It answers the question, 'How long can the system run without failing?' It is often measured by Mean Time Between Failures (MTBF).