Skip to content

Failover

Status
draft
Scope
cross_prime
Structural signature
A dependency structure with a vulnerable primary path and an available alternate path capable of assuming the protected function.
Failure modes
false_failover, failover_loop, split_brain, stale_backup_state, shared_failure_domain, untested_standby, partial_switchover, lost_in_flight_work, failback_corruption, backup_capacity_exhaustion, ambiguous_ownership, cascading_failover
Domain examples
software_and_distributed_systems, databases_and_stateful_services, networking_and_routing, power_and_infrastructure, disaster_recovery_and_business_continuity, organizational_succession, supply_chain_management, governance_and_command_authority

Intent

Failover preserves continuity of function, flow, authority, or service by switching from a failed, degraded, or unavailable primary path to an alternate path or capacity.

The archetype is useful when a system has an essential function that depends on a primary component, channel, provider, process, authority, or route. If that primary path fails, the system does not merely stop, shed load, or reduce quality. It transfers the protected function to a prepared alternate.

In compact form:

When a protected function depends on a vulnerable primary path, prepare and activate an alternate path so continuity is preserved when the primary fails.

Primes

Composed of: Redundancy, Observability, Health Check, Switching Rule, Alternate Path, State Synchronization, Recovery Policy, Controlled Reentry

Related primes: Redundancy, Fault Tolerance, Resilience, Robustness, Flow, Coupling, Observability, State and State Transition, Transaction, Data Integrity, Resource Management, Threshold, Boundary

Structural Signature

This archetype is a strong candidate when the following conditions co-occur:

  • A system has an essential function, flow, authority, process, service, or responsibility.
  • That function normally depends on a primary path, component, provider, channel, site, person, system, or authority.
  • Failure or degradation of the primary would interrupt the protected function.
  • An alternate path or capacity can plausibly assume the function.
  • The system can detect primary failure, degradation, or unavailability.
  • The system can transfer flow, authority, state, context, or responsibility without violating critical invariants.
  • There is a recovery, reconciliation, or failback policy once the primary returns.

Failover is especially relevant when continuity matters more than preserving the exact original path.

Intervention Signature

Detect primary failure or degradation, transfer function or flow to an alternate path, and preserve enough state, authority, or continuity for the alternate to operate safely.

The intervention changes the system from a single vulnerable dependency into a prepared continuity structure:

primary path
  -> failure detected
      -> alternate path activated
          -> protected function continues
              -> recovery or failback managed

The key move is not merely having a backup. It is the operational transfer of function to that backup when the primary can no longer safely serve.

Causal Logic

A system with only one viable path is vulnerable to interruption. If the primary path fails, all dependent function stops unless the system can either reduce demand, degrade service, repair the primary instantly, or switch to another path.

Failover works by changing dependency structure.

  1. Alternate capacity is prepared. A secondary component, route, provider, person, site, process, or authority exists before or during the failure.
  2. Primary failure becomes legible. The system detects unavailability, degraded performance, corruption, loss of authority, or unacceptable risk.
  3. Responsibility transfers. Flow, function, authority, or service moves from the primary to the alternate.
  4. State and context are preserved or reconciled. The alternate must know enough to operate safely.
  5. Continuity is maintained. The protected function continues with bounded disruption.
  6. Recovery is managed. The system either remains on the alternate, returns to the primary, or reconciles both paths.

The archetype transforms a single-path interruption into a managed switchover.

What It Is Not

Failover is not redundancy. Redundancy provides duplicate components or capacity. Failover is the intervention that activates or transfers function to alternate capacity when the primary fails or degrades.

Failover is not load balancing. Load balancing distributes ordinary load across multiple active resources. Failover transfers function because the primary path is unavailable, unsafe, or degraded. Some mechanisms combine both, but the intervention logic differs.

Failover is not Graceful Degradation. Graceful Degradation preserves core function by reducing lower-priority capabilities or service quality. Failover attempts to preserve the protected function by switching to alternate capacity.

Failover is not Fail-Safe. Fail-safe behavior moves a system to a harmless or minimal-impact state. Failover attempts continuity, not merely safety.

Failover is not Circuit Breaker. Circuit Breaker interrupts or meters flow at a boundary under overload or cascade risk. Failover redirects function or flow to an alternate path.

Failover is not Disaster Recovery in general. Disaster recovery may include restoration, rebuilding, communication, legal response, and business continuity. Failover is the specific archetypal move of switching function to an alternate path.

Failover is not simply manual repair. Repair restores the failed primary. Failover preserves continuity by using something else while the primary is failed, degraded, or unavailable.

Composition

Failover is composed from several lower-level abstractions:

  • Redundancy — Some alternate path, capacity, provider, site, authority, or process must exist.
  • Observability — Primary failure or degradation must be detectable.
  • Health check — The system needs a way to judge whether the primary and alternate are viable.
  • Switching rule — Some rule determines when responsibility transfers.
  • Alternate path — The protected function must have a route other than the primary.
  • State synchronization — Stateful failover requires state, context, authority, or ownership to be transferred or reconciled.
  • Recovery policy — The system must know what happens after failover succeeds or the primary returns.
  • Controlled reentry — Failback or reintegration must avoid recreating instability or corrupting state.

The composition matters. Redundancy without switching is only unused backup. Switching without state integrity can corrupt the system. Detection without a viable alternate only reveals failure. Failover without recovery policy can leave the system in an ambiguous or fragile state.

Mechanism Families

Common mechanism families include:

  • Primary-secondary service failover — A standby service takes over when the primary fails.
  • Active-passive or active-active redundancy — Alternate capacity is kept ready either idle, warm, hot, or actively serving.
  • Database replica promotion — A replica becomes the primary after primary failure.
  • Network route failover — Traffic is rerouted through alternate paths when a route, device, or link fails.
  • Power backup generator or UPS transfer — Electrical load shifts to backup power when primary supply fails.
  • Disaster recovery site activation — Operations move to an alternate site or region after major disruption.
  • Organizational delegation or succession protocol — Authority transfers to an alternate person or role when the primary decision-maker is unavailable.
  • Supply-chain alternate supplier activation — Procurement or production shifts to a backup supplier when the primary supplier fails.
  • Command authority succession — Governance or command responsibility transfers to a designated alternate when the primary authority cannot act.

These mechanisms differ by domain, but they preserve the same intervention logic: protected function continues because responsibility transfers to an alternate path.

Parameter Dimensions

Concrete mechanisms usually require tuning along dimensions such as:

  • Failure detection threshold — What counts as primary failure or unacceptable degradation?
  • Health-check cadence — How often is primary and alternate viability checked?
  • Switchover timeout — How quickly must transfer occur?
  • Replication lag tolerance — How stale may alternate state be?
  • State synchronization mode — Is state synchronized continuously, periodically, lazily, or manually?
  • Active-passive vs. active-active mode — Is the alternate idle, warm, hot, or already active?
  • Failback condition — When should the system return to the primary?
  • Retry policy — When should the system retry the primary before switching?
  • In-flight work policy — What happens to work underway during switchover?
  • Alternate capacity size — How much load can the alternate handle?
  • Authority transfer rule — Who or what has authority after failover?
  • Consistency requirement — Is the system allowed to trade consistency for availability?
  • Testing cadence — How often is failover exercised?

These are parameter dimensions, not the archetype itself.

Invariants to Preserve

Failover should preserve explicit invariants:

  • Protected function continues or fails cleanly — The system should not enter ambiguous half-service.
  • State integrity is preserved — Failover should not corrupt, duplicate, or silently lose critical state.
  • Ownership is unambiguous — There should not be two primaries acting independently unless the system is designed for that.
  • In-flight work is handled safely — Work should be completed, retried, rolled back, or cleanly rejected.
  • Trigger is observable and auditable — Operators or mechanisms should know why failover occurred.
  • Alternate readiness is maintained — Backup capacity should be sufficiently current, tested, and reachable.
  • Recovery does not corrupt state — Returning to the primary should not overwrite, duplicate, or fork state.

If these invariants cannot be preserved, failover may be more dangerous than outage.

Tradeoffs

Failover accepts preparation, duplication, and coordination costs in order to preserve continuity.

Typical tradeoffs include:

  • Duplicated capacity cost because alternate capacity must exist.
  • Synchronization overhead because state, data, context, or authority may need to be kept current.
  • Switchover complexity because failure detection and transfer rules must be reliable.
  • Split-brain risk when primary and alternate both believe they own the function.
  • Stale-state risk if the alternate is not fully current.
  • Underused backup resources when standby capacity sits idle.
  • Testing and maintenance burden because failover mechanisms must be exercised.
  • False failover risk if the system switches unnecessarily.
  • Failback complexity when returning to the original primary.

The archetype is therefore a continuity strategy, not a free resilience upgrade.

Contraindications

Failover is a poor fit when the alternate path cannot safely assume the protected function.

Use cautiously or avoid when:

  • no viable alternate capacity exists,
  • the alternate shares the same failure domain as the primary,
  • state cannot be transferred or reconciled safely,
  • the switchover delay exceeds the tolerated outage window,
  • false failover would be more damaging than primary degradation,
  • split-brain or dual authority would be catastrophic,
  • backup capacity is untested or stale,
  • the true problem is excess demand rather than primary unavailability,
  • the system needs graceful degradation, load shedding, or backpressure more than alternate activation.

In such cases, other archetypes may be better: graceful degradation, circuit breaker, bulkhead isolation, load shedding, backpressure, buffering, repair, or redesign.

Failure Modes

Common failure modes include:

  • False failover — The system switches away from a viable primary due to misleading signals.
  • Failover loop — The system repeatedly switches between paths, creating instability.
  • Split-brain — Primary and alternate both act as authoritative, causing conflict or corruption.
  • Stale backup state — The alternate lacks current state and produces incorrect behavior.
  • Shared failure domain — The alternate fails for the same reason as the primary.
  • Untested standby — Backup capacity exists on paper but fails when activated.
  • Partial switchover — Some flows move to the alternate while others remain on the failed primary.
  • Lost in-flight work — Work underway at the moment of failure is duplicated, dropped, or corrupted.
  • Failback corruption — Returning to the primary overwrites or conflicts with state created during failover.
  • Backup capacity exhaustion — The alternate cannot handle the transferred load.
  • Ambiguous ownership — Actors do not know who or what is responsible after transfer.
  • Cascading failover — Failover load overwhelms the alternate and spreads failure.

These failure modes should be treated as part of the archetype's design space.

Worked Example

An online service runs a primary database in one region and maintains a replicated standby database in another region. Under normal conditions, all writes go to the primary. The standby receives replicated updates but does not serve production traffic.

During an outage, the primary region becomes unreachable. If the service simply waits, users cannot complete transactions. The team initiates failover.

  • Health checks confirm the primary is unavailable.
  • The standby database is promoted to primary.
  • Application traffic is redirected to the new primary.
  • In-flight transactions are retried or reconciled according to policy.
  • Operators verify that ownership is unambiguous.
  • Once the original region recovers, the team decides whether to fail back or keep the new primary.

The intervention succeeds because the alternate was prepared, primary failure was detected, authority transferred clearly, and state integrity was preserved well enough for the service's requirements.

The key move is not merely having a replica. It is switching the protected function to that replica when the primary fails.

Cross-Domain Instances

  • Software and distributed systems — Services switch to standby instances or alternate regions when a primary dependency fails.
  • Databases and stateful services — Replicas are promoted when primary databases become unavailable.
  • Networking and routing — Traffic moves to alternate routes when links, routers, or paths fail.
  • Power and infrastructure — Backup generators, UPS systems, or alternate feeds assume load when primary power fails.
  • Disaster recovery and business continuity — Operations shift to alternate sites, systems, or processes after major disruption.
  • Organizational succession — Authority transfers to a designated deputy when the primary decision-maker is unavailable.
  • Supply-chain management — Production or procurement shifts to alternate suppliers when a primary supplier fails.
  • Governance and command authority — Command responsibility transfers through predefined succession rules when primary authority cannot act.

These examples are structurally related because each preserves function by switching from a failed or degraded primary path to a prepared alternate.

Notes

Failover should be reviewed alongside Redundancy, Graceful Degradation, Fail-Safe, Circuit Breaker, Bulkhead Isolation, Load Balancing, and Disaster Recovery.

The main conceptual risk is collapse into nearby concepts:

  • If the entry emphasizes duplicate capacity without activation, it becomes Redundancy.
  • If the entry emphasizes distributing ordinary load, it becomes Load Balancing.
  • If the entry emphasizes reduced capability, it becomes Graceful Degradation.
  • If the entry emphasizes a safe minimal stop, it becomes Fail-Safe.
  • If the entry emphasizes interrupting flow under overload, it becomes Circuit Breaker.
  • If the entry emphasizes broad continuity planning, it may become Disaster Recovery rather than Failover.

The current entry uses health_check, switching_rule, alternate_path, state_synchronization, and recovery_policy as solution-side labels. These may need later normalization as lower-level archetypal components, prime abstractions, or informal component labels.