Reliability

Definition

Reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time. The Reliability of a system is the probability that the system will operate correctly in a specified operating environment until time (assuming it was operating at time ).

If a system needs to work for slots of 10 hours at the time, then 10 hours is the reliability target. Reliability is often used to characterize systems in which even small periods of incorrect behavior are unacceptable, like in the impossibility to repair.

is a non-increasing function varying from 1 to 0 over , where

The probability density function of the failure is .

We define also the unreliability of the system as .

The exploitation of the information is used to compute, for a complex system, its reliability in time, that is the expected lifetime, and the computation of the Mean Time to Failure (MTTF). The computation of the overall reliability starting from the components’ one is also possible.

Info

We consider only a reliability with an exponential distribution, where the failure rate of the system is constant over time.

We will use several terms to describe the reliability of a system:

Definitions

  • Fault: is a condition that causes a system to fail in performing its required function.
  • Error: is the difference between a computed, observed, or measured value and the true, specified, or theoretically correct value.
  • Failure: is the event of a system or component not performing a required function according to its specifications.

Usually there is a consequence that leads to a failure, starting from a fault that causes an error that leads to a failure. However, it’s not always the case, as a fault can be non-activated, or an error can be non-propagated.

Availability

Definition

Availability is the degree to which a system or component is operational and accessible when required for use. The availability is defined as the ratio of the uptime to the sum of the uptime and the downtime; it’s also the ratio of the Mean Time To Failure (MTTF) to the sum of the MTTF and the Mean Time To Repair (MTTR).

The availability of a system is the probability that the system will be operational at time .

It’s fundamentally different from the reliability, as the reliability is the probability that the system will not break down, while the availability is the probability that the system will be working even if it breaks down. The availability is literally the readiness for service and admits the possibility of brief outages. When the system is not repairable the availability is equal to the reliability , while in general for repairable systems we have that .

We can define the unavailability of the system/component as .

Numbers

The availability can be expressed as a function of the “number of 9’s” and it is a fundamental number to understand the availability of a system.

Number of 9’sAvailabilityDowntime (mins/year)Downtime/yearSystem
190%52596.00~5 weeks
299%5259.60~4 daysGeneric website
399.9%525.96~9 hoursAmazon.com
499.99%52.60~1 hourEnterprise server
599.999%5.26~5 minutesTelephone system
699.9999%0.53~30 secondsNetwork switches
799.99999%0.05~3 seconds

MTTF and MTBF

Reliability and availability are related indices because, if a system is unavailable, it is not delivering the specified system services and, therefore, it’s not making money or providing the expected service. It is possible to have systems with low reliability that must be available, as system failures can be repaired quickly and do not damage data, low reliability may not be a problem. The opposite (high reliability and low availability) is generally more difficult.

Definitions

The Mean Time To Failure (MTTF) is the mean time before any failure will occur. It’s defined as the expected value of the time to failure.

The Mean Time Between Failures (MTBF) is the mean time between two failures. It’s defined as the expected value of the time between two failures.

is another way of reporting the MTBF, as the number of expected failures per one billion hours () of operation for a device.

Other indices

Any system’s lifetime can be divided into three phases:

  • Infant Mortality: failures showing up in new systems. Usually, this period of time is present during the testing phases, and not during production phases.
  • Random Failures: showing up randomly during the entire life of a system. This is the main focus of the reliability.
  • Wear Out: at the end of its life, some components can cause the failure of a system. Predictive maintenance can reduce the number of this type of failures.

To measure the reliability of a system and compute the MTTF, we can use the burn-in test, a test that stresses the system with excessive temperature, voltage, current, and humidity to accelerate wear out. This process is used to identify defective products, but it’s not always reliable because it can stress the system too much and it may require a lot of time.

An empirical evaluation of the reliability can be done by considering independent and statistically identical elements deployed at time in identical conditions. At time , elements are not failed, and are the times to failure of the elements. Times to failure are independent occurrences of the random quantity .

NOTE

The function is the empirical function of reliability that as converges to the value .

There are many different types of faults that can occur in a system, that can be divided into two main categories: hardware faults (electronic and mechanical) and software faults.

An hardware electronic fault have a balanced distribution of the failures over time, with an infant mortality phase, a useful life, and a wear-out phase. The probability density function of the failure is a bell-shaped curve, where the infant mortality phase and the wear-out phase are balanced, while the useful life is the longest and the phase with the lowest number of failures.

A mechanical fault has a similar distribution, but the infant mortality phase is very short and the wear-out phase is longer. The probability density function of the failure in the infant mortality phase is very low, and increases in the useful life phase, while it explodes in the wear-out phase.

A software fault has a different distribution, with an infant mortality phase and a useful life phase, but no wear-out phase, because the software doesn’t wear out; however, the update process can introduce new faults and the shape of the continue to follow the same distribution as a cycle.