Dependability

Dependability is a measure of how much we trust a system. It is the ability of a system to perform its functionality while exposing:

  • Reliability: the continuity of correct service
  • Availability: the readiness for correct service
  • Safety: the absence of catastrophic consequences
  • Security: the confidentiality and integrity of data

Why?

During the design process of a system, a lot of effort is devoted to make sure the implementation matches specifications, fulfills requirements, meets constraints, and optimizes selected parameters (performance, energy, …). Nevertheless, even if all above aspects are satisfied, systems may fail because something broke. Many things can cause a system to fail, such as:

  • Defects, process variation, degraded transistors
  • Radiation, noise and interference, temperature, humidity
  • Design errors, software bugs, OS bugs
  • Malicious attacks, human errors, operator mistakes

Considering all the above, it is important to obtain an “acceptable” output, which includes performance, integrity, availability, and security of the system.

A failure may cause high loss if it impacts economic losses or physical damage. Even a single system failure may affect a large number of people. Systems that are not dependable are likely not to be used or adopted. Undependable systems may cause information loss with a high consequent recovery cost.

When?

Dependability is an important aspect that a designer should always consider, both at design-time and at runtime. At design-time, the system should be analyzed, dependability properties should be measured, and the design should be modified if required to meet dependability requirements; at runtime, malfunctions should be detected, causes should be understood, and the system should react.

Failures may occur in development phase, that should be avoided, and in operation, that cannot be avoided and the system must deal with. The design should take failures into account and guarantee that control and safety are achieved when failures occur. The effects of such failures should be predictable and deterministic, not catastrophic.

Where?

Once upon a time, dependability has been a relevant aspect only for safety-critical and mission-critical application environments, such as space, nuclear, and avionics. These environments have to deal with huge costs.

In a mission-critical system, a failure during operation can have serious or irreversible effects on the mission the system is carrying out, such as satellites, surveillance drones, unmanned vehicles, and automatic weather stations in harsh environments. In a safety-critical system, a failure during operation can present a direct threat to human life, such as aircraft control systems, medical instrumentation, railway signaling, and nuclear reactor control systems.

Computing infrastructures, such as datacenters, are also a scenario where dependability is important. Downtime is the enemy of every data center, and a failure can have serious consequences, like financial losses, loss of reputation, and loss of customers. Other scenarios where dependability is important include automotive, telecommunication, smart spaces, and eHealth: all these scenarios have to deal with a high number of sensors, actuators, and computing systems that have to work together to provide the expected service. In these kinds of scenarios, many things may fails:

  • The nodes: computing systems, sensors, and actuators
  • The communication: network
  • The cloud: data storage and manipulation

Everything has to work properly to provide the expected service.

How?

Definitions

We define failure avoidance as the set of techniques that aim to avoid failures by employing conservative design, design validation, detailed test, infant mortality screen, and error avoidance. These techniques are used to prevent the occurrence of failures and to ensure that the system will not fail during its operation.

On the other hand, failure tolerance is the set of techniques that aim to detect and mask errors during system operation, perform on-line monitoring, diagnostics, and self-recovery & self-repair. These techniques are used to ensure that the system will continue to operate correctly even if a failure occurs.

We can provide dependability by working at different levels of the design process:

  • Technological level: design and manufacture by employing reliable/robust components, which provides the highest dependability but at a high cost and bad performance.
  • Architectural level: integrate normal components using solutions that allow managing the occurrence of failures, which provides high dependability but at a high cost and reduced performance.
  • Software/application level: develop solutions in the algorithms or in the operating systems that mask and recover from the occurrence of failures, which provides high dependability but at a high cost and reduced performance.

All solutions have in common the cost and reduced performance. You have to pay for dependability**.