Fault Tolerance¶

Prime #: 155
Origin domain: Computer Science & Software Engineering
Also from: Systems Thinking & Cybernetics, Engineering & Design
Aliases: Fault Resilience, Resilience Engineering
Related primes: Redundancy, Circuit Breaker, distributed systems, Robustness, failure modes

Core Idea¶

Fault tolerance is the ability of a system to continue functioning despite partial failures, either by designing for redundancy, fallback mechanisms, or self-correction.

How would you explain it like I'm…

Still Works When Broken

Imagine your bike has two brakes, one on each wheel. If one breaks while you're riding, the other still stops you safely. The bike was built knowing that things sometimes break, so one broken part doesn't ruin the whole ride. That's the idea: build stuff so a small problem doesn't turn into a big disaster.

Keeps Working If a Part Fails

Airplanes have more than one engine. If one quits in the middle of a flight, the others keep the plane flying so it can land safely. That's fault tolerance: designing a system on the honest assumption that some part is going to fail eventually, then adding backups, alarms, and ways to slow down gracefully so the whole thing keeps working. The goal isn't perfect parts; it's a whole system that survives imperfect parts.

Designed-In Failure Resilience

Fault tolerance is the design discipline of building systems that keep doing their job even when pieces of them break. Hardware burns out, software has bugs, networks drop, people make mistakes, attackers attack: a serious designer plans for all of that instead of pretending it won't happen. Common moves are redundancy (more than one of the critical part), monitoring (notice failures fast), failover (switch to a backup automatically), error correction (detect and patch corrupted data), and graceful degradation (lose features, not service). The rule of thumb is: no single point of failure. One thing dying must never take the whole system down.

Fault tolerance is the property of a system that continues to deliver acceptable service in the presence of component failures or adverse conditions. It is achieved through explicit incorporation of redundancy (duplicate components so one failure does not stop service), monitoring (detect anomalies), failover (automatic switchover to a healthy component), error correction (codes such as parity and ECC that recover corrupted bits), and graceful degradation (reduce features rather than collapse entirely). The defining design assumption is that components will fail and conditions will be imperfect; hardware ages, software has bugs, networks partition, humans err, adversaries attack. A fault-tolerant design is therefore evaluated not on the rate of component failure but on whether individual failures propagate into system-level failure. The standard structural target is 'no single point of failure': there must be no element whose loss alone breaks the system, within a specified set of failure modes the design is built to survive.

Broad Use¶

Computing: Distributed systems replicate data to maintain service availability.
Transportation: Airplanes have backup flight control systems in case of main system failure.
Biology: The human immune system has multiple overlapping defense mechanisms to ensure continued function if one fails.
Engineering: Buildings in earthquake-prone zones use shock-absorbing designs to prevent collapse.
Organizational Management: Leadership succession plans ensure that key roles are filled even if someone leaves unexpectedly.

Clarity¶

Separates systems designed to fail completely from systems that degrade gracefully, ensuring some level of continued function rather than catastrophic failure.

Manages Complexity¶

Encourages designing with failure in mind—identifying where redundancy, adaptability, and failover mechanisms are necessary.

Abstract Reasoning¶

Introduces the principle of resilience—systems should not assume perfect conditions but anticipate disruptions and adjust accordingly.

Knowledge Transfer¶

The idea of graceful degradation and redundancy planning appears in any field that needs resilience, from disaster response to economic policy.

Example¶

A coral reef ecosystem recovering from bleaching events due to its ability to regrow and adapt over time.

Relationships to Other Abstractions¶

Current abstraction Fault Tolerance Prime

Parents (2) — more general patterns this builds on

Fault Tolerance is a kind of Robustness Prime

Fault tolerance is a specialization of robustness focused on continued operation specifically under component failures rather than across all perturbations.
Fault Tolerance presupposes Reserve Prime

Fault Tolerance presupposes Reserve: continued operation under failures draws on redundancy and capacity held beyond expected need.

Children (2) — more specific cases that build on this

Fail-Safe Prime is a kind of Fault Tolerance

Fail-safe is a specialization of fault tolerance in which continued service is sacrificed and the post-failure default state is engineered to be the least harmful.
Escape and Leakage Prime presupposes Fault Tolerance

Escape and leakage presupposes fault tolerance because its underlying "Swiss-cheese" geometry treats leakage as latent failure paths penetrating layered defenses.

Hierarchy paths (3) — routes to 3 parentless roots

Fault Tolerance → Robustness

Show alternative paths (2)

Not to Be Confused With¶

Fault Tolerance is not Redundancy because fault tolerance is the property that the system maintains function despite component failures, while redundancy is the structural mechanism (multiple pathways) enabling fault tolerance; redundancy is the how, fault tolerance is the what.
Fault Tolerance is not Robustness because fault tolerance specifically addresses resilience in the presence of discrete failures (components break), while robustness addresses resistance to continuous disturbance (degraded performance, parameter variation); fault tolerance is binary (working or failed), robustness is continuous (degrees of performance).
Fault Tolerance is not Fail-Safe because fault tolerance seeks to maintain or restore function despite failures, while fail-safe seeks to revert to a safe state regardless of loss of function; the goals diverge: one prioritizes continuation, the other prioritizes safety.