Common Mode Failure Analysis¶

Identify shared dependencies that could cause supposedly independent backups or safeguards to fail together.

Essence¶

Common-Mode Failure Analysis protects a system from false independence. It asks whether backups, alternate suppliers, safety controls, redundant channels, governance checks, or substitute teams can fail together because they share a hidden cause.

The archetype matters whenever resilience claims rest on the existence of more than one path. A second path is not automatically independent. It may share the same power source, credentials, supplier, code base, physical environment, operating staff, design assumption, funding authority, political constraint, or recovery procedure. The intervention is to expose those shared causes before a real disruption exposes them.

Compression statement¶

When redundancy appears protective but the redundant elements share hidden dependencies, environments, design assumptions, authorities, or operators, analyze common-mode failure so independence claims are tested and the redundancy can be diversified, isolated, hardened, or governed with explicit residual risk.

Canonical formula: redundancy set + independence assumption + shared dependency map + common failure mode map + mitigation plan + validation probe -> redundancy that is credible against correlated failure

When to Use This Archetype¶

Use this archetype when a system says, in effect, “we have a backup,” “we have two suppliers,” “we have multiple channels,” “we have separate safeguards,” or “we have redundant controls,” and the consequences of simultaneous failure would matter.

It is especially relevant when independence has been assumed rather than tested. The trigger is not merely the presence of risk; it is the combination of redundancy, an independence claim, and a plausible shared cause that could defeat the redundancy. It is also useful after incidents where several protections disappeared together, because those incidents often reveal a common-mode structure that was invisible in normal operation.

Structural Problem¶

The structural problem is that nominal redundancy can hide a single point of failure. A system may have many apparent backups, but those backups can be coupled through a shared dependency. That coupling can be technical, physical, institutional, economic, social, or cognitive.

For example, two data centers may be in different places but rely on the same identity provider. Two suppliers may have different names but source the same subcomponent. Two clinical workflows may appear separate but require the same record system login. Several institutional safeguards may be formally independent but rely on the same budget authority or information pipeline.

The deeper tension is between efficiency and independence. Standard platforms, common vendors, common training, centralized authority, and shared infrastructure make systems easier to manage. They also create the possibility that one cause will defeat many protections at once.

Intervention Logic¶

The intervention begins by naming the protected function. The question is not just “what parts do we have?” but “what function must survive?” Once that function is clear, the redundancy set can be named: the components, channels, suppliers, controls, teams, sites, data sources, or pathways expected to preserve the function.

The next step is to state the independence assumption. What kinds of failures are these elements supposed not to share? Power loss, market shock, operator error, software defect, weather exposure, credential outage, funding cutoff, public panic, misinformation, political pressure, or supplier collapse can all be common causes.

Then the analysis maps shared dependencies and builds common failure modes. The output is not just a map; it is a decision. The system may need more diversity, stronger isolation, independent authority, different design lineage, separate recovery access, better monitoring, or explicit residual-risk acceptance. Finally, the independence claim should be probed under stress. A backup that has only been tested in normal conditions may not be a backup under the condition that matters.

Key Components¶

Common-Mode Failure Analysis is organized around testing an independence claim that a system is implicitly making when it relies on more than one path for a critical function. The Protected Function Set names what must survive, so the review judges viability rather than parts inventory. The Redundancy Set lists the elements being treated as substitutes — servers, suppliers, controls, teams, channels — and defines the scope of the independence claim. The Independence Assumption makes that claim explicit and falsifiable: what kinds of failure are these elements supposed not to share. Together these three components frame the question; without them the analysis devolves into a generic FMEA.

The diagnostic core then exposes how the claim could fail. The Shared Dependency Map traces upstream resources, environments, credentials, vendors, authorities, staff, and procedures that supposedly separate paths actually share. The Common Failure Mode Map converts those shared dependencies into plausible scenarios where one cause defeats multiple protections at once. The Correlated Exposure Register records what would be lost, how severely, who owns the mitigation, and what residual risk remains — turning analysis into accountable resilience work rather than a one-time discovery exercise.

The remaining components close the loop from diagnosis to action and verification. The Diversity or Isolation Requirement specifies what counts as enough separation, so the standard for a credible backup is explicit rather than aspirational. The Mitigation Plan chooses among diversification, isolation, hardening, alternate authority, or explicit residual-risk acceptance, matching the response to the exposure. Finally, the Independence Validation Probe tests the backup under loss of the shared dependency — not in normal conditions, where false reassurance is easiest — so the independence claim is challenged by the condition that would actually defeat it.

Component	Description
Protected Function Set ↗	The protected function set identifies what must survive: a service, flow, safeguard, decision, care process, communication channel, or institutional obligation. Without this component, the review can become a generic inventory of parts rather than a test of whether the important function remains viable.
Redundancy Set ↗	The redundancy set names the elements that are being treated as substitutes or backups for one another. These might be servers, suppliers, teams, warehouses, clinics, approval routes, data sources, safety controls, or public communication channels. This component defines the scope of the independence claim.
Independence Assumption ↗	The independence assumption makes explicit the belief that the redundant elements will not fail for the same reason. It is the central claim under test. In many failures, the assumption was never written down, so no one noticed that the backup depended on the same condition as the primary path.
Shared Dependency Map ↗	The shared dependency map reveals upstream resources, environments, designs, credentials, vendors, authorities, staff, data, or procedures shared by supposedly separate paths. It is often the most important discovery tool because common-mode risks frequently sit outside the formal boundary of the component being reviewed.
Common Failure Mode Map ↗	The common failure mode map connects shared dependencies to plausible failure scenarios. A shared dependency is not automatically catastrophic; it becomes a common-mode concern when it can defeat multiple protections under a condition that matters.
Correlated Exposure Register ↗	The correlated exposure register records which redundant elements are exposed together, which functions would be lost, how severe the loss would be, who owns the mitigation, and what residual risk remains. It turns the analysis into accountable resilience work.
Diversity or Isolation Requirement ↗	The diversity or isolation requirement specifies what counts as enough separation. That may mean geographic separation, different vendor lineage, separate credentials, independent recovery authority, different skills, alternate media, compartmentalized infrastructure, or different design assumptions.
Mitigation Plan ↗	The mitigation plan defines how to reduce the common-mode exposure. Mitigation can include diversifying suppliers, isolating control planes, hardening a shared dependency, adding alternate authority, redesigning backup activation, or accepting residual risk explicitly when mitigation is not justified.
Independence Validation Probe ↗	The independence validation probe tests whether the claimed backup or safeguard remains available when a shared dependency fails. It is not enough to verify that the backup works in normal conditions; the probe must challenge the condition that could disable multiple paths at once.

Common Mechanisms¶

Mechanism	Description
Common-Cause FMEA ↗	Common-cause FMEA adapts failure mode analysis to ask which single causes could defeat multiple protections. It is a mechanism, not the archetype itself. It implements the archetype when it leads to independence testing, mitigation, and residual-risk decisions.
Fault Tree with Common-Cause Branching ↗	A fault tree can represent shared causal branches that lead to multiple failures. This is useful when teams need to see how a single upstream condition can pass through several apparently separate paths.
Dependency Mapping Workshop ↗	A dependency mapping workshop brings together people who see different parts of the system. Common-mode risks often cross team boundaries, so no single operator, supplier manager, engineer, or policy owner can see the whole exposure alone.
Backup Independence Test ↗	A backup independence test exercises a backup under a shared dependency outage. The point is not merely to turn the backup on. The point is to see whether it still works without the same network, credential, supplier, authority, facility, fuel, or staff that the primary path requires.
Supply-Chain Dependency Review ↗	A supply-chain dependency review traces alternate suppliers down to shared sub-tier suppliers, shipping routes, regional exposures, labor constraints, and regulatory chokepoints. It prevents “two suppliers” from becoming a misleading substitute for genuine independence.
Diverse Vendor Review ↗	A diverse vendor review checks whether vendor diversity is real across infrastructure, ownership, code lineage, support, credentials, and failure response. It is useful only when it tests shared exposure rather than merely counts vendor names.
Correlated Risk Register ↗	A correlated risk register records common-mode exposures, owners, mitigation status, test evidence, and residual-risk acceptance. It helps keep the analysis alive after the initial review.
Tabletop Cascade Exercise ↗	A tabletop cascade exercise simulates a shared failure cause and follows how redundant paths respond together. It can reveal timing, authority, communication, and human coordination dependencies that static maps miss.
Credential and Infrastructure Dependency Audit ↗	This audit checks whether emergency systems, backups, and alternate teams still depend on the same identity provider, cloud control plane, network, physical access system, or infrastructure service. It is especially important in digital and organizational systems.

Parameter / Tuning Dimensions¶

The first tuning dimension is scope. A common-mode review can focus on one protected function, one backup pair, one supplier category, one facility, one control family, or an entire operational ecosystem. Narrow scope is easier to act on; broad scope better reveals systemic coupling.

The second dimension is dependency depth. Looking only one layer upstream may miss sub-tier suppliers, hidden credentials, design lineage, or shared authority. Looking too deeply can make the analysis unmanageable. The right depth depends on consequence severity and practical decision value.

The third dimension is independence standard. Some systems only need partial independence; others require strong physical, procedural, vendor, or authority separation. The standard should be explicit enough that a backup can pass or fail it.

The fourth dimension is test realism. More realistic tests reveal more, but they also carry more operational risk. A good validation probe is bounded, reversible, and serious enough to challenge the independence assumption.

The fifth dimension is residual risk tolerance. Eliminating every shared cause is impossible. The archetype requires explicit acceptance of residual common-mode risk, not an illusion of perfect separation.

Invariants to Preserve¶

The first invariant is a credible independence claim. If a system claims that a backup or safeguard is independent, that claim should be backed by dependency analysis and stress validation.

The second invariant is protected function continuity under shared stress. At least one viable path should remain for the function that matters when a plausible common cause disables another path.

The third invariant is visibility of shared dependencies. Hidden dependencies should be made reviewable, especially when they connect several safeguards.

The fourth invariant is traceable mitigation ownership. High-consequence correlated exposures should not remain as anonymous warnings. Someone should own the mitigation, monitoring, test, or residual-risk acceptance.

The fifth invariant is avoidance of false redundancy theater. The system should not reassure people with duplicate-looking elements that will disappear together under the same condition.

Target Outcomes¶

A successful common-mode analysis reduces surprise. Teams are less likely to discover during a crisis that all backups depended on the same unavailable condition.

It also improves the value of redundancy investments. Instead of spending resources on duplicate capacity that shares the decisive failure cause, the system can invest in separation, diversity, isolation, or hardening where it matters.

The archetype also improves adjacent resilience patterns. Redundant Backup Provisioning becomes more credible because the backups are tested for independence. Diverse Functional Redundancy becomes more targeted because the required diversity is tied to actual shared exposures. Fault-Tolerant Operation becomes safer because continuation paths are less likely to fail together.

Tradeoffs¶

The main tradeoff is independence versus efficiency. Shared platforms, common vendors, standard procedures, and centralized authority reduce cost and complexity, but they also concentrate failure exposure.

A second tradeoff is diversity versus maintainability. Different designs, vendors, credentials, and procedures can protect against common-mode failure, but they require more training, testing, integration, and governance.

A third tradeoff is analysis depth versus actionability. A shallow review can miss decisive shared causes; an excessively deep review can become paralyzing. The goal is useful exposure reduction, not exhaustive mapping for its own sake.

A fourth tradeoff is realism versus safety in testing. Stress tests should challenge independence assumptions, but they should not create unmanaged harm.

Failure Modes¶

The most common failure mode is cosmetic redundancy. The system has two of something, but both depend on the same hidden condition. This is mitigated by dependency mapping and independence validation.

Another failure mode is checklist theater. The review is completed, but no design, sourcing, procedure, test, or residual-risk decision changes. This is mitigated by tying findings to owners and actions.

A third failure mode is hidden design lineage. Redundant elements look separate but share code, model assumptions, procedures, training, or vendor templates. This is mitigated by design-lineage checks and independent validation where needed.

A fourth failure mode is shared recovery dependency. The backup exists, but activation or recovery requires the same network, credentials, staff, fuel, manual, spare part, or authority that failed. This is mitigated by testing recovery under loss of the shared dependency.

A fifth failure mode is over-diversification. The organization diversifies everything, creating unnecessary complexity and new errors. This is mitigated by prioritizing by consequence, likelihood, detectability, and mitigation cost.

Neighbor Distinctions¶

Common-Mode Failure Analysis is distinct from Redundant Backup Provisioning. Provisioning adds backup capacity; common-mode analysis checks whether that capacity is genuinely independent against relevant failure causes.

It is distinct from Diverse Functional Redundancy. Diverse Functional Redundancy designs multiple ways to fulfill the same function; common-mode analysis determines whether existing or proposed ways share a failure cause and what diversity or isolation is needed.

It is distinct from Fault-Tolerant Operation. Fault tolerance keeps operating under partial failure; common-mode analysis reduces the chance that all continuation paths fail together.

It is distinct from Dependency Exposure. Dependency Exposure is a broader pattern for revealing hidden dependencies. Common-Mode Failure Analysis focuses specifically on dependencies that undermine redundancy, backup, or safeguard independence.

It is distinct from FMEA. FMEA is a method family and can be one mechanism used here. The archetype is the broader cross-domain intervention of challenging independence assumptions and mitigating correlated failure.

Cross-Domain Examples¶

In cloud operations, redundant regions may share one identity provider or deployment pipeline. Common-mode analysis asks whether emergency access, rollback, and recovery work when that shared service is unavailable.

In supply chains, two suppliers may share a sub-tier supplier, logistics route, region, or regulatory chokepoint. Common-mode analysis tests whether the alternate source is actually independent under the relevant shock.

In healthcare, backup workflows may depend on the same electronic records system or authorization path as the primary workflow. Common-mode analysis asks whether patient care can continue when that shared dependency fails.

In finance, hedges that look different in normal conditions may all depend on the same liquidity, counterparty network, or model assumption during crisis. Common-mode analysis looks for correlation under stress, not just surface diversity.

In public governance, multiple oversight bodies may be formally independent but share budget authority, appointment channels, or information sources. Common-mode analysis asks whether institutional safeguards can fail together.

Non-Examples¶

A single fragile component with no backup is not Common-Mode Failure Analysis. It calls first for backup provisioning or robustness improvement.

A backup that fails because it was never maintained is not necessarily common-mode failure. It becomes common-mode analysis if the same maintenance dependency can disable several backups together.

A standard FMEA table listing isolated component failures is not the archetype. It is a possible mechanism only when it includes shared causes, correlated exposure, mitigation, and validation.

A general supplier inventory is not the archetype unless it evaluates which shared dependencies can cause supposedly independent suppliers to fail together.

Abstractions this archetype builds on — directly (a source ingredient) or as a related pattern. Links follow the typed catalog namespace.

Built directly on (3)

Coupling: Interdependence among subsystems.
Functional Redundancy (Degeneracy): Multiple pathways fulfill same function.
Redundancy: Duplicate critical components.

Also references 9 related abstractions

Boundary: Defines system limits.
Failure Mode and Effects Analysis (FMEA): Identify failure modes.
Fault Tolerance: Continue operating under failure.
Observability: Infer internal state externally.
Relation: Describes associations or dependencies.
Resilience: Absorb shocks and adapt.
Risk Aversion: Preference for certainty.
Robustness: Maintain functionality under stress.
Uncertainty: Incomplete knowledge.

Variants¶

Narrower or domain-specific specializations that share this archetype's core structure. Recognized variants are established; candidate variants are provisional.

Common-Cause Failure Analysis · risk or failure variant · recognized

A causal form of common-mode review that asks which single cause could defeat multiple redundant paths or safeguards.

Distinct from parent: The parent includes dependency, lineage, environment, and correlation analysis; this variant emphasizes common causes as the organizing lens.
Use when: A single initiating event, defect, operator action, environmental condition, or governance failure could remove multiple protections; Teams need a cause-centered analysis rather than a dependency inventory alone.
Typical domains: safety engineering, cloud operations, public infrastructure
Common mechanisms: Common-Cause FMEA, Fault Tree with Common-Cause Branching

Backup Independence Analysis · implementation variant · recognized

A backup-focused variant that tests whether alternate capacity, suppliers, records, channels, or roles remain available under the same stress that disables the primary path.

Distinct from parent: The parent can evaluate any redundant safeguards; this variant specifically targets backup paths and standby resources.
Use when: The system already has backup resources and the question is whether they are truly independent; Recovery plans assume alternates will be available but have not tested shared dependencies.
Typical domains: software reliability, hospital continuity, records management, emergency communications
Common mechanisms: Backup Independence Test, Tabletop Cascade Exercise

Shared Dependency Review · implementation variant · candidate

A dependency-centered review that looks for upstream resources, controls, or conditions shared by multiple protections.

Distinct from parent: The parent includes mitigation and validation; this variant is the mapping-heavy part of the intervention.
Use when: The same dependency may sit behind several supposedly separate paths; The immediate need is visibility before deciding on diversification, isolation, or residual-risk acceptance.
Typical domains: infrastructure, software operations, governance, supply chains
Common mechanisms: Dependency Mapping Workshop, Credential and Infrastructure Dependency Audit

Correlated Risk Analysis · risk or failure variant · candidate

A quantitative or portfolio-style variant that analyzes whether risks assumed to offset or diversify one another actually fail together under stress.

Distinct from parent: The parent is broader and need not be quantitative; this variant is useful where portfolio or correlated-risk vocabulary dominates.
Use when: Multiple hedges, safeguards, or resources are expected to diversify risk; The key concern is correlation under stress rather than ordinary failure rate.
Typical domains: finance, supply chains, emergency planning, infrastructure portfolios
Common mechanisms: Correlated Risk Register

Near names: common-mode failure review, common-cause failure review, common-cause failure analysis, false redundancy analysis, backup independence testing, shared dependency failure analysis.