Skip to content

Intermittent Failure Capture

Essence

Intermittent Failure Capture is the pattern of making elusive failures diagnosable by preserving evidence while the failure is happening. The target problem is not simply that a system fails. The target problem is that the system fails irregularly, then returns to normal before the relevant state, context, sequence, or witness detail can be inspected.

The archetype works by turning a temporary episode into a durable episode record. A trigger condition opens a capture window; snapshots, traces, logs, and contextual details are preserved; the evidence is protected from ordinary cleanup; and a follow-up diagnostic path turns the record into repair, prevention, or better instrumentation.

Compression statement

When failures occur intermittently and vanish before ordinary inspection, define episode triggers, capture windows, state snapshots, contextual traces, evidence preservation rules, recurrence analysis, and follow-up diagnostic paths so rare or transient failures become analyzable rather than anecdotal.

Canonical formula: intermittent_episode + trigger_condition + capture_window + state_snapshot + contextual_trace + evidence_preservation + follow_up_analysis -> diagnosable_failure_record

When to Use This Archetype

Use this archetype when failures or symptoms are real but hard to reproduce, when ordinary inspection arrives after the episode has disappeared, and when diagnosis depends on transient state or context. It is especially useful when continuous full monitoring is too costly, too intrusive, or too noisy, but targeted capture during episodes is feasible.

Good fits include intermittent production bugs, episodic health symptoms, rare equipment defects, safety near misses, field events, and organizational process breakdowns that look normal during scheduled reviews. The pattern is weaker when the failure is continuously visible, when safe reproduction is easy, or when useful capture would violate privacy, consent, safety, or proportionality constraints.

Structural Problem

The structural problem is an evidence timing mismatch. Diagnosis needs information from the failure state, but the failure state is intermittent and self-erasing. Logs rotate, systems reset, symptoms fade, witnesses forget, cleanup routines remove evidence, and people reinterpret episodes after the fact.

This creates recurring root-cause ambiguity. Each episode is treated as isolated or anecdotal because the common sequence across episodes is unavailable. The system may accumulate reports without accumulating diagnosis.

Intervention Logic

The intervention is to design a capture loop around the episode itself. First, define what counts as the failure and what signal should trigger capture. Next, decide what state, context, sequence, and observations would make the episode explainable. Then create a capture window that preserves evidence before, during, and after the visible symptom.

Once evidence is captured, it must be protected. If normal recovery overwrites logs, if records are detached from their episode, or if no one owns review, the pattern fails. The final step is recurrence analysis: compare captured episodes and turn the findings into corrective action, safer reproduction, prevention, or better capture rules.

Key Components

Intermittent Failure Capture turns self-erasing episodes into preserved evidence packets by designing a capture loop around the failure itself. The Trigger Condition defines the observable sign, threshold, anomaly, or report that opens evidence collection — specific enough to avoid constant noise, sensitive enough to catch episodes before important state is lost. The Capture Window defines how much evidence before, during, and after the visible symptom should be retained, because causes often appear before the failure and recovery evidence often explains how the system returned to normal. Within that window, three components preserve different layers of the episode: the State Snapshot captures internal state such as configuration, queue contents, variable values, and resource levels at the moment of failure; the Contextual Trace records surrounding conditions like timing, workload, dependency state, environmental readings, and recent changes; and the Event Log reconstructs the chronological sequence of signals, actions, and transitions so investigators do not have to rely on memory.

Three further components ensure that captured evidence becomes diagnosis rather than data hoarding. The Evidence Preservation Rule protects captured material from being overwritten, normalized away, or detached from its episode by ordinary cleanup routines — the difference between a useful record and a temporary diagnostic artifact. Recurrence Analysis compares episodes to find common preconditions, sequences, triggers, or hidden causes, since a single capture may make the problem visible while several captures often make it diagnosable. The Follow-up Diagnostic Path assigns review, escalation, reproduction boundaries, and corrective action, ensuring captured evidence routes into repair, prevention, or improved capture rules rather than accumulating unread. Together these components let intermittent failures become analyzable instead of anecdotal.

ComponentDescription
Trigger Condition defines the observable sign, threshold, anomaly, or report that starts evidence capture. It must be specific enough to avoid constant noise and sensitive enough to catch episodes before important state is lost.
Capture Window defines how much evidence before, during, and after the episode should be retained. The window matters because causes often appear before the visible failure and recovery evidence may explain how the system returned to normal.
State Snapshot preserves the internal state at or near the failure moment: configuration, queue state, variable values, resource levels, user/session state, environmental readings, or other domain-specific details.
Contextual Trace captures surrounding conditions such as timing, workload, dependency state, user action, medication timing, staffing, weather, or recent changes. The context explains why the episode occurred in this instance.
Event Log provides the chronological sequence of signals, actions, transitions, and observations. It lets investigators reconstruct the episode rather than relying on memory.
Evidence Preservation Rule prevents captured evidence from being overwritten, normalized away, discarded, or altered before analysis. It is the difference between a useful episode record and a temporary diagnostic artifact.
Recurrence Analysis compares episodes to find common preconditions, sequences, triggers, or hidden causes. A single capture may make the problem visible; several captures often make it diagnosable.
Follow-up Diagnostic Path assigns review, escalation, reproduction boundaries, and corrective action. Without this path, capture becomes data hoarding rather than diagnosis.

Common Mechanisms

Mechanisms implement the archetype; they are not the archetype itself. A Flight Recorder or Black Box Log maintains recent state so evidence survives the episode. A Trigger-Based Debug Trace or Automatic Diagnostic Capture turns on richer evidence collection when a symptom or threshold fires. An Incident Snapshot packages state, context, and logs into one preserved record. A Symptom Diary lets a person capture episodic human experience when sensors are unavailable. A Rare Event Monitor stays ready for low-frequency episodes, and a Post-Episode Evidence Review turns captured material into analysis while context is still fresh.

The mechanism choice depends on episode frequency, diagnostic need, privacy risk, cost, and whether evidence is best captured by automation, resilient recording, human reporting, or review workflow.

Parameter / Tuning Dimensions

Important tuning dimensions include trigger sensitivity, trigger specificity, pre-event buffer length, post-event capture duration, snapshot depth, contextual breadth, storage retention, access control, review urgency, false-positive tolerance, and capture overhead.

The central tuning problem is balancing diagnostic power against cost and intrusion. More evidence can reveal causes, but broad capture can create privacy risk, analysis burden, performance impact, and false certainty. The best design captures enough to distinguish the episode from normal operation while minimizing unnecessary exposure.

Invariants to Preserve

The first invariant is that episode evidence must survive beyond the episode. The second is that state, context, timing, and observations must remain linked as one record. The third is that ordinary operation, recovery, and human safety must not be compromised by the capture setup.

A fourth invariant is that capture must remain tied to diagnosis. The intervention is not justified by accumulation alone; it should support review, learning, repair, prevention, or a decision to retire the capture mechanism.

Target Outcomes

The desired outcome is diagnosable failure episodes. Instead of vague reports that a problem “happened again,” the system produces preserved evidence: what state it was in, what conditions surrounded it, what sequence led into it, and what changed afterward.

Successful use reduces root-cause ambiguity, shortens the path to corrective action, reveals recurrence patterns, and reduces dependence on unsafe reproduction attempts or excessive continuous monitoring.

Tradeoffs

Intermittent Failure Capture trades simplicity for diagnostic power. It can add instrumentation overhead, storage cost, workflow burden, and governance complexity. It can also create ethical concerns when captured evidence includes sensitive human behavior, personal data, or security details.

The archetype also trades immediacy for accuracy. Capturing and preserving evidence may slow cleanup or require additional review. In safety-critical contexts, evidence collection must never delay emergency response.

Failure Modes

A common failure mode is a trigger that fires too late or too narrowly, missing the relevant state. Another is thin capture: the system records that the episode occurred but omits the context needed to explain it. Capture can also overwhelm review capacity when triggers are too broad or evidence packets are too large.

Other failure modes include privacy violations, instrumentation that changes the behavior being diagnosed, evidence stored without analysis, and overinterpretation of a small number of captured episodes. These can be mitigated through capture budgets, minimization, trigger review, access control, and explicit diagnostic ownership.

Neighbor Distinctions

Intermittent Sampling detects intermittent states by choosing observation windows or triggers. Intermittent Failure Capture preserves richer state and context during actual failure episodes so diagnosis remains possible afterward.

Observability Instrumentation broadly exposes system state. Intermittent Failure Capture is narrower: it is episodic, evidence-preserving, and diagnosis-centered.

Perturbation Testing tries to induce or reproduce behavior under controlled change. Intermittent Failure Capture waits for naturally occurring episodes and captures evidence when they happen.

Intermittent Burst Absorption protects operations from irregular spikes. Intermittent Failure Capture diagnoses elusive failures; it may not absorb or mitigate the episode at all.

Recurrence Pattern Detection can analyze captured records, but it is usually a component or follow-on analysis rather than the capture archetype itself.

Variants and Near Names

Recognized variants include Trigger-Based Diagnostic Capture, Episodic Symptom Capture, Flight-Recorder Failure Capture, and Rare Event Evidence Capture. Near names include Episodic Failure Capture, Rare Event Capture, Intermittent Bug Capture, Triggered Diagnostic Capture, Incident Snapshot, Symptom Diary, and Flight-Recorder Logging.

Simple event loggers, debug traces, incident snapshots, flight recorders, and diaries should collapse into mechanisms unless they include the full trigger, capture-window, preservation, recurrence-analysis, and follow-up logic.

Cross-Domain Examples

In software operations, a rare production bug can trigger expanded traces and state snapshots because ordinary logs are too thin after recovery. In healthcare, a symptom diary can preserve timing, context, and recovery details for episodes absent during appointments. In manufacturing, a defect detector can freeze sensor state, material batch metadata, operator actions, and environmental readings when an intermittent flaw appears.

In safety engineering, black-box logs preserve pre-incident state for post-event investigation. In organizational reliability, teams can record workload, handoff state, decision context, and dependency signals whenever an episodic escalation failure occurs.

Non-Examples

A daily dashboard of aggregate error rates is not this archetype unless it preserves episode-level failure evidence. A generic bug report saying “it happened again” is not enough because it lacks state and context. Random spot checks for compliance are usually Intermittent Sampling. A burst queue that holds demand during a spike is Intermittent Burst Absorption, not failure capture.