Availability

A good availability metric should be:

Meaningful
Proportional
Actionable

Meaningful captures the users’ experience. Proportional captures a change in the metric that should be proportional to the user-perceived change. Actionable captures insights into why the availability for a period was so low.

Quantify Availability

The two most common approaches to quantify availability are Success ratio and Incident ratio.

Success ratio (%)

The fraction of the number of successful requests to total requests over a period of time

The period of time is not important.

Incident ratio (%)

The ratio of up minutes to down minutes

“up minutes” are based on the duration of known incidents

Availability Metrics

Time-based

The time between failures is uptime and the time to recover from failure is downtime.

availability = uptime / (uptime + downtime)

Requirements:

Manual labeling of uptime and downtime
Use of a given threshold

Count-based

It uses success ratio, which is the number of successful requests to total requests. It perceives what the user perceives over time.

Characteristics:

Easy to implement.
Prone to bias: by a highly active user

Synthetic Probes

Synthetic monitoring means that a script is used to create an activity that is monitored, the activity can be anything from a simple ping to determine if a server is up, to an emulated user transaction which uses a real browser. Probes the system automatically at regular time intervals.

It mitigates some of the problems of success ratio when emulating user transactions.

User-Uptime

There are two ways to achieve:

Synthetic probes for each user
User requests as probes

Both formulas can be used in the Incident ratio and Success ratio.

References

Paper:

Tamas Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean and Amer Diwan Google Meaningful Availability

Video: