Incident Replay
Comparing traditional incident analysis with deterministic replay
Note: This page presents a simplified scenario to illustrate how deterministic replay can support incident analysis. Timelines shown are indicative and depend on system complexity, integration context, and organisational factors. Deterministic replay improves reproducibility for verification purposes but does not guarantee correctness or eliminate all operational risk.
The Debugging Challenge in Safety-Critical Systems
When an autonomous vehicle experiences a near-miss, when a medical device exhibits unexpected behavior, when an aircraft control system enters an unanticipated state—the clock starts ticking. Production is halted. Engineers are mobilized. Regulators demand answers. And the organization faces a fundamental question: What happened?
In traditional systems, answering this question can be challenging. Logs may be incomplete. Behavior may be difficult to reproduce. Engineers can spend days—sometimes weeks—attempting to reconstruct events from fragmentary evidence. The process can be expensive, time-consuming, and sometimes inconclusive.
Deterministic platforms can significantly reduce this uncertainty. Execution traces can be recorded with cryptographic verification. Replay is designed to be bit-identical under defined conditions. Root cause analysis that might take considerable time with traditional approaches can potentially be materially shortened.
Why Traditional Debugging Can Take Days
The Non-Determinism Challenge
Traditional real-time operating systems are fundamentally non-deterministic. Thread scheduling depends on interrupt timing. Memory allocation depends on previous allocations. Network I/O depends on packet arrival order. The same code, given the same inputs, can produce different execution paths on different runs.
This non-determinism makes incident reconstruction often difficult:
- Incomplete logs: Comprehensive logging can impact real-time performance, so engineers log selectively—and critical details may be missing
- Difficult reproduction: Attempting to reproduce the incident in a test environment may fail because execution paths differ
- Heisenberg debugging: Adding instrumentation changes timing, potentially preventing the bug from manifesting
- Inference-based analysis: Engineers hypothesize about what might have happened based on incomplete evidence
Illustrative Traditional Debugging Timeline
For a typical safety-critical incident (autonomous vehicle sensor fusion issue, as shown in the demo):
Traditional Approach: Illustrative timeline (32-72 hours)
Logs are distributed across multiple systems. Timestamps may be approximate. Sensor data is sampled, not continuous. Engineers piece together a timeline from fragmentary evidence.
Test environment may not reproduce the failure. Different timing, different execution paths. Engineers try repeatedly with different configurations, hoping to trigger the bug.
Without reproduction, engineers read code looking for plausible failure modes. Multiple hypotheses emerge. Each requires investigation.
Add detailed logging around suspected code paths. Deploy to field or wait for recurrence. Analyze new logs. May reveal the hypothesis was incorrect, requiring another iteration.
Illustrative result: Days or weeks of investigation, potentially inconclusive
How Deterministic Replay Can Improve the Process
Deterministic platforms such as MDCP can record execution state at every tick. The recording can be cryptographically verifiable and designed to be bit-identical on replay. This can transform incident analysis:
Deterministic Approach: Illustrative timeline (~8 minutes)
Execution trace is automatically captured and cryptographically signed. Engineer loads the trace file—typically 10-50MB for several minutes of real-time execution.
Deterministic replay produces reproducible execution. State transitions, sensor reads, and control outputs replay as they occurred in production. Reduces reliance on inference and reconstruction.
Debugger stops at the tick where the issue occurred. Visibility into variables, core states, and system state. Engineer examines sensor values, timing relationships, state machine positions.
Apply proposed fix. Replay incident with fix in place. Provides strong evidence that the fix addresses the reproduced behaviour for this specific scenario.
Illustrative result: Minutes from incident to identified root cause and validated fix
Potential Business Impact
Regulatory Engagement
When you can provide detailed evidence of what happened—not a hypothesis, not a best guess, but a cryptographically verified replay—this can improve the clarity of incident reports. The FAA, FDA, and automotive safety authorities recognise the value of reproducible execution evidence. Clear incident documentation may support more efficient regulatory engagement.
Remediation Decisions
Field issues that might otherwise require precautionary recalls can potentially be analyzed more quickly. Instead of extended investigation periods, engineers may be able to diagnose root causes faster and make more targeted remediation decisions. This can help organisations respond more precisely to field issues.
Development Velocity
Deterministic replay can also accelerate development. Bugs found during integration testing become reproducible. Test runs produce consistent results. CI/CD pipelines can replay failures automatically. Engineers may spend less time reproducing bugs and more time fixing them.
Certification Evidence
Safety certification (ASIL, DO-178C, IEC 62304) requires demonstrating correct behavior under specified conditions. Deterministic replay can strengthen evidence quality by providing reproducible verification artefacts. This does not replace certification activities but can support more efficient evidence generation.
Customer Communication
When incidents occur—and they will—customers want answers. "We're investigating" is less satisfying than being able to provide detailed analysis. Having cryptographically verified replay showing what happened, the identified root cause, and evidence that a fix addresses the behaviour can support clearer customer communication.
Deployment Scenarios
Autonomous Vehicles
Autonomous vehicles are distributed sensor fusion systems with many inputs, multiple processors, and safety-critical outputs. When a near-miss occurs, deterministic replay can support:
- Reconstruction of sensor timing relationships
- Analysis of perception model decisions
- Investigation of planning algorithm edge cases
- Evidence of control system behavior under specific conditions
Medical Devices
Implanted cardiac devices, insulin pumps, ventilators—all must respond correctly to physiological events. When unexpected behavior occurs, replay can support:
- Reconstruction of patient sensor data and device state
- Analysis of therapy delivery timing
- Investigation of edge cases in physiological models
- Evidence generation for regulatory submissions
Aerospace Systems
Flight control, avionics, engine management—aerospace systems operate in environments where reliability is critical. Replay can support:
- Post-flight incident analysis with improved state visibility
- Evidence generation for DO-178C compliance
- Analysis of redundancy and voting algorithms
- Testing of emergency procedures under specific conditions
Explore Deterministic Replay
The demonstration above shows a simplified incident scenario for illustrative purposes. Real deployments handle more complex state spaces. Whether you're developing autonomous systems, medical devices, or aerospace platforms, deterministic replay can support how you approach incident analysis and verification.