Incident Replay

Comparing traditional incident analysis with deterministic replay

Note: This page presents a simplified scenario to illustrate how deterministic replay can support incident analysis. Timelines shown are indicative and depend on system complexity, integration context, and organisational factors. Deterministic replay improves reproducibility for verification purposes but does not guarantee correctness or eliminate all operational risk.

The Debugging Challenge in Safety-Critical Systems

When an autonomous vehicle experiences a near-miss, when a medical device exhibits unexpected behavior, when an aircraft control system enters an unanticipated state—the clock starts ticking. Production is halted. Engineers are mobilized. Regulators demand answers. And the organization faces a fundamental question: What happened?

In traditional systems, answering this question can be challenging. Logs may be incomplete. Behavior may be difficult to reproduce. Engineers can spend days—sometimes weeks—attempting to reconstruct events from fragmentary evidence. The process can be expensive, time-consuming, and sometimes inconclusive.

Deterministic platforms can significantly reduce this uncertainty. Execution traces can be recorded with cryptographic verification. Replay is designed to be bit-identical under defined conditions. Root cause analysis that might take considerable time with traditional approaches can potentially be materially shortened.

Why Traditional Debugging Can Take Days

The Non-Determinism Challenge

Traditional real-time operating systems are fundamentally non-deterministic. Thread scheduling depends on interrupt timing. Memory allocation depends on previous allocations. Network I/O depends on packet arrival order. The same code, given the same inputs, can produce different execution paths on different runs.

This non-determinism makes incident reconstruction often difficult:

Incomplete logs: Comprehensive logging can impact real-time performance, so engineers log selectively—and critical details may be missing
Difficult reproduction: Attempting to reproduce the incident in a test environment may fail because execution paths differ
Heisenberg debugging: Adding instrumentation changes timing, potentially preventing the bug from manifesting
Inference-based analysis: Engineers hypothesize about what might have happened based on incomplete evidence

Illustrative Traditional Debugging Timeline

For a typical safety-critical incident (autonomous vehicle sensor fusion issue, as shown in the demo):

Traditional Approach: Illustrative timeline (32-72 hours)

Step 1: Review scattered logs (4-8 hours)

Logs are distributed across multiple systems. Timestamps may be approximate. Sensor data is sampled, not continuous. Engineers piece together a timeline from fragmentary evidence.

Step 2: Attempt reproduction (8-16 hours)

Test environment may not reproduce the failure. Different timing, different execution paths. Engineers try repeatedly with different configurations, hoping to trigger the bug.

Step 3: Code review and hypothesis generation (4-8 hours)

Without reproduction, engineers read code looking for plausible failure modes. Multiple hypotheses emerge. Each requires investigation.

Step 4: Targeted instrumentation (16-40 hours)

Add detailed logging around suspected code paths. Deploy to field or wait for recurrence. Analyze new logs. May reveal the hypothesis was incorrect, requiring another iteration.

Illustrative result: Days or weeks of investigation, potentially inconclusive

How Deterministic Replay Can Improve the Process

Deterministic platforms such as MDCP can record execution state at every tick. The recording can be cryptographically verifiable and designed to be bit-identical on replay. This can transform incident analysis:

Deterministic Approach: Illustrative timeline (~8 minutes)

Step 1: Load execution trace (<1 minute)

Execution trace is automatically captured and cryptographically signed. Engineer loads the trace file—typically 10-50MB for several minutes of real-time execution.

Step 2: Replay execution (<1 minute)

Deterministic replay produces reproducible execution. State transitions, sensor reads, and control outputs replay as they occurred in production. Reduces reliance on inference and reconstruction.

Step 3: Inspect state at failure point (5-10 minutes)

Debugger stops at the tick where the issue occurred. Visibility into variables, core states, and system state. Engineer examines sensor values, timing relationships, state machine positions.

Step 4: Validate fix (<1 minute)

Apply proposed fix. Replay incident with fix in place. Provides strong evidence that the fix addresses the reproduced behaviour for this specific scenario.

Illustrative result: Minutes from incident to identified root cause and validated fix

Potential Business Impact

Regulatory Engagement

When you can provide detailed evidence of what happened—not a hypothesis, not a best guess, but a cryptographically verified replay—this can improve the clarity of incident reports. The FAA, FDA, and automotive safety authorities recognise the value of reproducible execution evidence. Clear incident documentation may support more efficient regulatory engagement.

Remediation Decisions

Field issues that might otherwise require precautionary recalls can potentially be analyzed more quickly. Instead of extended investigation periods, engineers may be able to diagnose root causes faster and make more targeted remediation decisions. This can help organisations respond more precisely to field issues.

Development Velocity

Deterministic replay can also accelerate development. Bugs found during integration testing become reproducible. Test runs produce consistent results. CI/CD pipelines can replay failures automatically. Engineers may spend less time reproducing bugs and more time fixing them.

Certification Evidence

Safety certification (ASIL, DO-178C, IEC 62304) requires demonstrating correct behavior under specified conditions. Deterministic replay can strengthen evidence quality by providing reproducible verification artefacts. This does not replace certification activities but can support more efficient evidence generation.

Customer Communication

When incidents occur—and they will—customers want answers. "We're investigating" is less satisfying than being able to provide detailed analysis. Having cryptographically verified replay showing what happened, the identified root cause, and evidence that a fix addresses the behaviour can support clearer customer communication.

Deployment Scenarios

Autonomous Vehicles

Autonomous vehicles are distributed sensor fusion systems with many inputs, multiple processors, and safety-critical outputs. When a near-miss occurs, deterministic replay can support:

Reconstruction of sensor timing relationships
Analysis of perception model decisions
Investigation of planning algorithm edge cases
Evidence of control system behavior under specific conditions

Medical Devices

Implanted cardiac devices, insulin pumps, ventilators—all must respond correctly to physiological events. When unexpected behavior occurs, replay can support:

Reconstruction of patient sensor data and device state
Analysis of therapy delivery timing
Investigation of edge cases in physiological models
Evidence generation for regulatory submissions

Aerospace Systems

Flight control, avionics, engine management—aerospace systems operate in environments where reliability is critical. Replay can support:

Post-flight incident analysis with improved state visibility
Evidence generation for DO-178C compliance
Analysis of redundancy and voting algorithms
Testing of emergency procedures under specific conditions

Explore Deterministic Replay

The demonstration above shows a simplified incident scenario for illustrative purposes. Real deployments handle more complex state spaces. Whether you're developing autonomous systems, medical devices, or aerospace platforms, deterministic replay can support how you approach incident analysis and verification.

Discuss Your Debugging Challenges