# SpeyTech — Full Documentation > Deterministic computing systems for safety-critical environments. This document contains the complete text of key SpeyTech pages and articles for LLM consumption. For a summary version, see: https://speytech.com/llms.txt --- # Company Overview SpeyTech develops deterministic software platforms for aerospace, medical devices, and autonomous systems. Founded by William Murray, a Regenerative Systems Architect with 30 years of UNIX infrastructure experience, based in the Scottish Highlands. **Founder**: William Murray **Location**: Inverness, Scottish Highlands, UK **Academic affiliation**: Visiting Scholar, Heriot-Watt University **Experience**: 30 years UNIX systems engineering --- # Core Concepts ## Why Deterministic Computing Deterministic computing ensures that given the same inputs, a system will always produce the same outputs, in the same order, with the same timing. This property is fundamental to safety-critical systems. ### The Problem with Non-Determinism Conventional software exhibits non-deterministic behaviour due to: - **Race conditions** — Thread scheduling varies between runs - **Floating-point variance** — Results differ across platforms and compilers - **Memory allocation** — Dynamic allocation introduces unpredictable timing - **External dependencies** — Network, disk, and OS interactions vary In safety-critical domains (aerospace, medical devices, autonomous vehicles), non-determinism creates: - **Certification challenges** — Cannot prove behaviour without reproducibility - **Debugging complexity** — Heisenbugs that disappear under observation - **Incident reconstruction** — Cannot replay failures accurately - **Liability exposure** — Cannot demonstrate due diligence ### The Deterministic Alternative Deterministic systems provide: - **Bit-identical execution** — Same results across platforms, compilers, runs - **Provable timing** — Worst-case execution time (WCET) guarantees - **Complete reproducibility** — Any execution can be replayed exactly - **Simplified certification** — Demonstrate compliance through replay ### When Determinism Matters Deterministic computing is essential for: - DO-178C Level A aerospace systems - IEC 62304 Class C medical devices - ISO 26262 ASIL-D automotive systems - Any system where failure has safety implications --- # Products ## MDCP — Murray Deterministic Computing Platform MDCP is a tick-based deterministic execution substrate for safety-critical systems. It provides mathematical guarantees of reproducibility through architectural constraints rather than testing. **Patent**: GB2521625.0 ### Core Architecture MDCP replaces conventional scheduling with tick-based execution: - **Tick-based scheduling** — All operations occur at defined tick boundaries - **Deterministic memory** — Static allocation with bounded access patterns - **Lock-free coordination** — No mutexes, no deadlocks, no priority inversion - **Cryptographic sealing** — Every state transition is hash-chained ### Technical Properties | Property | Guarantee | |----------|-----------| | Execution order | Identical across runs | | Timing | Bounded worst-case (WCET) | | Memory | O(1) static allocation | | State transitions | Hash-chained audit trail | ### Kernel Components MDCP comprises 14 kernel components: - Core scheduling and tick management - Deterministic memory allocation - Inter-process communication - Cryptographic state binding - Health monitoring and fault detection ### Certification Alignment MDCP is designed to support certification under: - DO-178C (aerospace) — Levels A through D - IEC 62304 (medical devices) — Classes A, B, C - ISO 26262 (automotive) — ASIL A through D - IEC 61508 (industrial) — SIL 1 through 4 --- ## MDLCE — Murray Deterministic Liability Closure Engine MDLCE provides cryptographic execution binding for compliance attestation and post-incident analysis. It creates unforgeable proof of what code executed, when, and with what results. **Patent**: GB2522369.4 ### Purpose When incidents occur in safety-critical systems, investigators need to answer: - What code was running? - What were the inputs? - What decisions did the system make? - Can we reproduce the failure? MDLCE provides cryptographic answers to all four questions. ### Technical Approach - **Execution binding** — Hash-chain links code version to execution trace - **Input attestation** — All inputs cryptographically timestamped - **Decision logging** — State machine transitions recorded with proofs - **Replay capability** — Byte-identical reconstruction of any execution ### Liability Protection MDLCE supports: - **Regulatory compliance** — Demonstrate adherence to standards - **Incident investigation** — Prove what happened, not what might have happened - **Product liability defence** — Unforgeable evidence of correct operation - **Insurance requirements** — Quantifiable risk through provable behaviour --- ## CardioCore — Deterministic Medical Device Kernel CardioCore is a deterministic kernel reference implementation for implantable cardiac devices. It demonstrates how MDCP principles apply to IEC 62304 Class C medical device software. ### Safety Philosophy CardioCore implements the VITA-SB (Verified Implantable Technology Architecture — Safety Bounded) philosophy: - **Fail-safe defaults** — Every state has a safe fallback - **Bounded execution** — All operations complete within defined time - **Triple redundancy** — Critical calculations verified independently - **Deterministic pacing** — Timing guarantees for life-critical functions ### Technical Properties | Property | Implementation | |----------|----------------| | Execution model | Tick-based, deterministic | | Memory model | Static allocation only | | Timing model | WCET-proven operations | | Fault handling | TMR with voting | ### Regulatory Alignment CardioCore is designed to align with: - **IEC 62304** — Software lifecycle for medical devices - **IEC 60601** — Medical electrical equipment safety - **FDA guidance** — Software as a Medical Device (SaMD) ### Why It Matters Implantable cardiac devices face unique challenges: - **Zero tolerance for failure** — Device failure can be fatal - **Long operational life** — Devices operate for years without service - **Limited observability** — Cannot easily debug in-situ - **Regulatory scrutiny** — Extensive pre-market and post-market requirements Deterministic execution addresses these by making behaviour provable rather than merely tested. --- # Open Source Projects All projects available at: https://github.com/SpeyTech ## Fixed-Point Fundamentals **Repository**: https://github.com/SpeyTech/fixed-point-fundamentals **License**: MIT **Documentation**: https://speytech.com/open-source/fixed-point-fundamentals/ Learn fixed-point arithmetic from first principles — because 'close enough' isn't deterministic Most engineers learn floating-point arithmetic and never question it. IEEE 754 is convenient, widely supported, and "good enough" for most applications. Until it isn't. When you need **deterministic results** — the same output for the same input, every time, on every platform — floating-point becomes a liability. When certification bodies ask you to prove your arithmetic is bounded, floating-point makes that proof difficult. When accumulated rounding errors cause your control system to drift, floating-point is the culprit. Fixed-point arithmetic solves these problems. But most engineers never learned it properly. [View on GitHub](https://github.com/SpeyTech/fixed-point-fundamentals) ## The Problem with Floating-Point Floating-point arithmetic has three fundamental issues for safety-critical systems: ### 1. Non-Determinism Across Platforms The same floating-point code can produce different results on different hardware: ```c // This may give different answers on x87 vs SSE vs ARM float result = a * b + c * d; ``` The x87 FPU uses 80-bit extended precision internally. SSE uses 64-bit. ARM has its own quirks. Compiler flags like `-ffast-math` change behaviour. The "same" computation isn't the same at all. ### 2. Accumulation Drift Small rounding errors compound over time: ```c float sum = 0.0f; for (int i = 0; i > 16); } ``` Without widening, the multiplication would overflow silently. With widening, you have room for the full result before narrowing back to your target format. This pattern — widen, compute, narrow — appears everywhere in fixed-point code. Master it and you've mastered half of fixed-point arithmetic. ## Rounding: Why It Matters More Than You Think When you narrow a result, you lose precision. How you handle that loss matters: **Truncation** (round toward zero) introduces systematic bias. If you truncate repeatedly, errors accumulate in one direction. **Round-half-up** (school rounding) has the same problem — it biases toward positive infinity. **Round-to-nearest-even** (banker's rounding) is statistically unbiased. When the value is exactly halfway between two representable numbers, it rounds to the nearest even number. Over many operations, the positive and negative roundings cancel out. ```c // Round-to-nearest-even for Q16.16 division q16_16_t q16_div_rne(q16_16_t a, q16_16_t b) { int64_t wide = ((int64_t)a 0) ? b/2 : -b/2; if (remainder > half || (remainder == half && (quotient & 1))) { quotient += (b > 0) ? 1 : -1; } return (q16_16_t)quotient; } ``` Lesson 04 demonstrates this with a 1-million-operation test. Truncation drifts to zero. Round-half-up drifts positive. RNE stays centred. ## Overflow: The Silent Killer In C, signed integer overflow is **undefined behaviour**. The compiler is free to assume it never happens, which can lead to surprising optimisations that break your code. Fixed-point code must handle overflow explicitly: ```c typedef struct { q16_16_t value; uint8_t flags; // Sticky fault flags } q16_16_result_t; #define FAULT_OVERFLOW 0x01 #define FAULT_UNDERFLOW 0x02 #define FAULT_SATURATED 0x04 q16_16_result_t q16_add_safe(q16_16_t a, q16_16_t b, uint8_t *flags) { int64_t wide = (int64_t)a + (int64_t)b; q16_16_result_t result; if (wide > INT32_MAX) { result.value = INT32_MAX; *flags |= FAULT_OVERFLOW | FAULT_SATURATED; } else if (wide kp, error); // Integral term (accumulate in Q32.32 to prevent overflow) pid->integral = q32_add(pid->integral, q_mul_q8_24_q16_16(pid->ki, q_mul(error, dt))); // Derivative term q16_16_t derivative = q_div(q_sub(error, pid->last_error), dt); q32_32_t d_term = q_mul_q8_24_q16_16(pid->kd, derivative); pid->last_error = error; // Sum and convert to output format q32_32_t output = q32_add(q32_add(p_term, pid->integral), d_term); return q32_to_q16(output); // Saturate if needed } ``` Coefficients use Q8.24 (small values, high precision). The integral accumulator uses Q32.32 (wide range, prevents overflow). Inputs and outputs use Q16.16 (general purpose interface). ### Sine Lookup Table with Linear Interpolation When you can't afford the cycles for CORDIC or polynomial approximation: ```c // 256-entry quarter-wave table static const q16_16_t sine_table[257] = { 0x00000000, 0x00000648, 0x00000C8F, /* ... */ }; q16_16_t q16_sin(q16_16_t angle) { // Reduce to [0, 2π) angle = angle & 0x0000FFFF; // Assuming 2π = 0x10000 // Determine quadrant and index uint32_t quadrant = (angle >> 14) & 0x3; uint32_t index = (angle >> 6) & 0xFF; uint32_t frac = angle & 0x3F; // Lookup with linear interpolation q16_16_t y0 = sine_table[index]; q16_16_t y1 = sine_table[index + 1]; q16_16_t result = y0 + (((y1 - y0) * frac) >> 6); // Apply quadrant symmetry // ... return result; } ``` A 257-entry table (1KB) gives you better than 16-bit precision with simple linear interpolation. No floating-point transcendentals required. ## The Certification Bridge This course teaches the **fundamentals** with standalone, copy-paste-friendly code under the MIT license. For **production safety-critical systems**, the [certifiable-inference](https://github.com/SpeyTech/certifiable-inference) project provides: | This Course | certifiable-* Ecosystem | |:------------|:------------------------| | Teaching implementations | Production implementations | | Standalone examples | Ecosystem integration | | MIT license | GPL + CLA for IP protection | | "Here's how it works" | "Here's proof it works" | The certifiable-* ecosystem adds Merkle audit trails, cross-platform bit-identity verification, and documentation templates aligned with DO-178C, IEC 62304, and ISO 26262. If you're building systems that need certification, that's where you go after learning the fundamentals here. ## Getting Started ```bash git clone https://github.com/SpeyTech/fixed-point-fundamentals.git cd fixed-point-fundamentals make ``` Each lesson is self-contained. Start with Lesson 00 to understand why floating-point fails, or jump to the topic you need. **Prerequisites:** C programming (comfortable with integers and bit operations), basic arithmetic, a C compiler. No external dependencies. ## What You'll Build By the end of this course, you'll be able to: - Implement fixed-point arithmetic in strict C99 - Choose appropriate Q formats for your signal characteristics - Handle overflow, underflow, and rounding correctly - Build production-grade control systems with bounded, deterministic behaviour - Understand the path from teaching implementations to certified production code ## Reference Materials The course includes formal specifications following the same methodology used in aerospace and medical device development: - **FPF-MATH-001** — Mathematical closure architecture - **FPF-STRUCT-001** — Data structure specification Plus quick-reference materials: - **Q Formats** — Common formats and their properties - **Common Pitfalls** — Mistakes to avoid - **Further Reading** — Where to go next ## Related Reading - [Fixed-Point Neural Networks: The Math Behind Q16.16](https://speytech.com/insights/fixed-point-neural-networks/) - [Round-to-Nearest-Even: The Rounding Mode That Makes Determinism Possible](https://speytech.com/insights/round-to-nearest-even/) - [Why Floating Point Is Dangerous](https://speytech.com/ai-architecture/floating-point-danger/) - [Bit-Perfect Reproducibility: Why It Matters and How to Prove It](https://speytech.com/insights/bit-perfect-reproducibility/) --- *Prove first, code second. MIT licensed.* As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. --- ## certifiable-bench **Repository**: https://github.com/SpeyTech/certifiable-bench **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-bench/ Performance benchmarking for deterministic ML — because 'fast' means nothing if you can't prove it's correct Performance benchmarking in safety-critical systems has a fundamental problem: you can't compare performance across platforms unless you first prove the outputs are identical. [View on GitHub](https://github.com/SpeyTech/certifiable-bench) ## The Problem Standard benchmarking tools measure how fast code runs. They don't verify that the code produces the same results on different hardware. For most software, this doesn't matter. For safety-critical ML inference, it matters enormously. Consider deploying a neural network trained on x86 to a RISC-V edge device. The standard approach: benchmark both, compare latency. But what if floating-point rounding differences, SIMD variations, or library implementations cause the outputs to differ? You've optimised for speed on a system that's now producing different results. DO-178C, IEC 62304, and ISO 26262 all require evidence of correct behaviour. A benchmark showing "RISC-V is 2.3x slower than x86" is meaningless if the outputs don't match. ## The Solution: The Bit-Identity Gate certifiable-bench introduces a simple but critical concept: the bit-identity gate. You can only compare performance after proving the outputs are identical. ```c cb_compare_results(&x86_result, &riscv_result, &comparison); if (comparison.outputs_identical) { // Outputs match — performance comparison is meaningful printf("RISC-V is %.2fx slower\n", comparison.latency_ratio_q16 / 65536.0); } else { // Outputs differ — comparison is invalid printf("ERROR: Outputs differ, comparison invalid\n"); } ``` This changes the workflow. Instead of "benchmark first, hope outputs match", it's "verify outputs, then benchmark". The verification uses FIPS 180-4 SHA-256 to hash outputs during the benchmark run (outside the timing loop), then compares hashes across platforms. ## What's Implemented The harness provides six modules, each with formal requirements documentation: | Module | SRS | Tests | Purpose | |--------|-----|-------|---------| | Timer | SRS-001 | 10,032 | High-resolution timing with 23ns measured overhead | | Metrics | SRS-002 | 1,502 | Statistics, histograms, outlier detection, WCET estimation | | Platform | SRS-006 | 35 | CPU detection, hardware counters, environment monitoring | | Verify | SRS-004 | 113 | SHA-256 hashing, golden references, result binding | | Runner | SRS-003 | 92 | Warmup phases, critical loop, configurable iterations | | Report | SRS-005 | 66 | JSON/CSV output, cross-platform comparison | Total: 11,840 assertions across approximately 233 SHALL statements in the requirements documentation. ## The Critical Loop The benchmark runner separates timing from verification: ```c for (i = 0; i measure_iterations; i++) { /* === CRITICAL LOOP START === */ t_start = cb_timer_now_ns(); rc = fn(ctx, input, output); t_end = cb_timer_now_ns(); /* === CRITICAL LOOP END === */ samples[i] = t_end - t_start; /* Verification OUTSIDE critical timing */ if (config->verify_outputs) { cb_verify_ctx_update(&verify_ctx, output, output_size); } } ``` Verification happens outside the timed region. The final hash covers all outputs from all iterations, ensuring determinism across the entire run. ## Statistics That Matter for Certification Beyond mean and standard deviation, certifiable-bench computes metrics required for safety certification: **WCET Estimation**: Worst Case Execution Time bound calculated as `max + 6×stddev`. This provides a conservative estimate for real-time scheduling. **Percentiles**: p50, p95, p99 for understanding tail latency. **Outlier Detection**: MAD-based detection (Iglewicz & Hoaglin method) to identify anomalous samples from thermal throttling, interrupts, or other interference. **Histogram**: Configurable bins with overflow/underflow tracking for visualising the full latency distribution. All statistics use integer-only arithmetic where possible, avoiding floating-point non-determinism in the measurement infrastructure itself. ## Platform Detection The harness automatically detects: - **Architecture**: x86_64, aarch64, riscv64 - **CPU model**: From `/proc/cpuinfo` - **Frequency**: From sysfs, with stability monitoring - **Hardware counters**: Via `perf_event` (cycles, instructions, cache misses, branch mispredictions) - **Environment**: Temperature, throttle events A stability check flags results if CPU frequency drifts more than 5% during the benchmark run. ## Result Binding Each benchmark result is cryptographically bound to its context: ``` H(output_hash || platform || config || stats || timestamp) ``` This creates a tamper-evident record. If someone claims "we achieved X latency on platform Y", the hash can be verified against the full result data. ## Usage ```c #include "cb_runner.h" #include "cb_report.h" cb_result_code_t my_inference(void *ctx, const void *in, void *out) { // Your neural network inference return CB_OK; } int main(void) { cb_config_t config; cb_result_t result; cb_config_init(&config); config.warmup_iterations = 100; config.measure_iterations = 1000; cb_run_benchmark(&config, my_inference, model_ctx, input_data, input_size, output_data, output_size, &result); cb_print_summary(&result); cb_write_json(&result, "benchmark_x86.json"); return 0; } ``` ## Pipeline Context certifiable-bench sits between certifiable-inference (the deterministic ML engine) and certifiable-harness (the end-to-end verification system): ``` certifiable-inference ──→ certifiable-bench ──→ certifiable-harness ↑ │ └───────────────────────┘ Performance data ``` The harness runs the inference engine with benchmark instrumentation, produces JSON reports, and the harness consumes these for cross-platform comparison. Model bundles from certifiable-deploy can include baseline benchmark data for regression testing. ## Why This Matters For aerospace (DO-178C), medical devices (IEC 62304), and automotive (ISO 26262), timing data is part of the certification evidence package. Section 6.3.4 of DO-178C specifically requires "Software Timing and Sizing Data" as a verification output. But timing data without correctness verification is incomplete evidence. certifiable-bench provides both: proof of determinism and measurement of performance. The bit-identity gate ensures you never accidentally compare apples to oranges. If x86 and RISC-V produce different outputs, you find out immediately, before drawing conclusions about relative performance. ## Getting Started ```bash git clone https://github.com/SpeyTech/certifiable-bench cd certifiable-bench mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release make ctest --output-on-failure ``` ## Documentation The repository includes formal documentation: - **CB-MATH-001**: Mathematical foundations (statistics, verification, comparison algorithms) - **CB-STRUCT-001**: Data structure specifications - **SRS-001 through SRS-006**: Requirements documents with ~233 SHALL statements ## Current Status The harness is feature-complete and ready for integration testing: - ✅ All statistics and verification tests pass - ✅ JSON/CSV reporting working - ✅ Cross-platform comparison implemented - ⏳ Bit-identity verification on RISC-V pending hardware access - ⏳ CI regression detection pending infrastructure setup As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. --- ## certifiable-harness **Repository**: https://github.com/williamofai/certifiable-harness **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-harness/ End-to-end test harness for deterministic ML — because 'it works on my machine' isn't certifiable How do you prove that your ML pipeline produces identical results on different hardware? Not "similar" results. Not "statistically equivalent" results. *Identical* — bit for bit, hash for hash. I ran the certifiable-harness on Linux with GCC and macOS with Clang. Different operating system. Different compiler. Seven pipeline stages. Same SHA-256 hashes. Every stage. Every time. **certifiable-harness** orchestrates all seven stages of the certifiable-* pipeline, captures cryptographic commitments, and compares them against a golden reference. [View on GitHub](https://github.com/williamofai/certifiable-harness) ## The Problem Traditional ML frameworks don't even try for cross-platform determinism. Floating-point rounding varies by CPU. Memory allocation affects hash table iteration order. Thread scheduling is inherently non-deterministic. For most applications, that's fine. A 0.001% difference in model output doesn't matter. For safety-critical systems, it's a fundamental problem. If you can't prove that the model running on deployed hardware is *exactly* the same as what you tested, you can't certify it. The certifiable-* ecosystem solves this through fixed-point arithmetic, static allocation, and deterministic algorithms throughout. But how do you *prove* it works? You run the harness. ## How It Works ```bash $ ./certifiable-harness --golden result.golden --output verify.json ═══════════════════════════════════════════════════════════════ Certifiable Harness v1.0.0 Platform: x86_64 ═══════════════════════════════════════════════════════════════ [0] data ✓ (OK, 4 µs) [1] training ✓ (OK, 3 µs) [2] quant ✓ (OK, 3 µs) [3] deploy ✓ (OK, 3 µs) [4] inference ✓ (OK, 3 µs) [5] monitor ✓ (OK, 4 µs) [6] verify ✓ (OK, 8 µs) Status: ALL STAGES PASSED ✓ Bit-identical: YES ✓ ═══════════════════════════════════════════════════════════════ ``` If any stage produces a different hash, the harness tells you exactly which one diverged. No more "it works on my machine" — either the hashes match or they don't. ## The Golden Reference The harness generates a 368-byte golden reference file containing: | Offset | Size | Field | |--------|------|-------| | 0x00 | 4 | Magic ("CHGR") | | 0x04 | 4 | Version | | 0x08 | 32 | Platform string | | 0x28 | 8 | Timestamp | | 0x30 | 32 | Config hash | | 0x50 | 32 | Harness hash | | 0x70 | 224 | Stage commitments (7 × 32) | | 0x150 | 32 | File hash | The file hash covers bytes 0x00–0x14F, enabling tamper detection. If anyone modifies the golden reference, verification fails. ### Generating a Golden Reference ```bash ./certifiable-harness --generate-golden --output result.json ``` This produces: - `result.json` — Human-readable report - `result.json.golden` — 368-byte binary for cross-platform comparison ### Verifying Against Golden Copy the golden file to another platform and run: ```bash ./certifiable-harness --golden result.json.golden --output their_result.json ``` If the hashes match: `Bit-identical: YES ✓` ## Verified Platforms | Platform | OS | Compiler | Result | |----------|-----|----------|--------| | x86_64 | Linux (Ubuntu) | GCC 12.2.0 | ✓ Bit-identical | | x86_64 | macOS 11.7 | Apple Clang | ✓ Bit-identical | | aarch64 | — | — | Pending | | riscv64 | — | — | Pending | The Linux and macOS results were generated on different machines, different operating systems, different compilers. Same hashes. ## Seven Pipeline Stages Each stage corresponds to a certifiable-* project: | Stage | Project | Commitment | |-------|---------|------------| | 0 | certifiable-data | Merkle root of batches | | 1 | certifiable-training | Training chain hash | | 2 | certifiable-quant | Quantization certificate | | 3 | certifiable-deploy | Attestation root | | 4 | certifiable-inference | Predictions hash | | 5 | certifiable-monitor | Ledger digest | | 6 | certifiable-verify | Report hash | The harness runs them in sequence, passing context between stages. Each stage's commitment includes the previous stage's commitment, forming an unbroken cryptographic chain. ## Test Coverage | Component | Tests | Description | |-----------|-------|-------------| | Harness | 4 | Orchestration, config, platform detection | | Golden | 3 | Load, save, compare, integrity | | Stages | 4 | Stage wrappers, dependency management | | Report | 2 | JSON generation, console output | 4 test suites, all passing. 81 traceable requirements across 4 SRS documents. ## Part of a Complete Pipeline certifiable-harness orchestrates the entire certifiable-* ecosystem: ``` data → training → quant → deploy → inference → monitor → verify ↑ ↓ └──────────────── certifiable-harness ────────────────────┘ ``` It's the proving ground — the place where cross-platform determinism is verified, not assumed. ## When You Need This **Hardware Vendors**: If you're building AI accelerators — RISC-V, custom silicon, FPGAs — certifiable-harness lets you prove your hardware produces bit-identical results to reference implementations. **Certification Bodies**: The harness produces machine-verifiable evidence. No manual inspection required. Run the harness, check the hashes, file the report. **Safety-Critical Deployments**: When a regulator asks "how do you know the deployed model is the same as what you tested?", you have a 368-byte answer. ## Getting Started ```bash git clone https://github.com/williamofai/certifiable-harness.git cd certifiable-harness mkdir build && cd build cmake .. make ctest --output-on-failure # Generate golden reference ./certifiable-harness --generate-golden --output result.json # Verify (should show Bit-identical: YES) ./certifiable-harness --golden result.json.golden --output verify.json ``` ## Documentation The repository includes formal documentation suitable for certification evidence: - **CH-MATH-001.md** — Mathematical specification (18KB) - **CH-STRUCT-001.md** — Data structure specification - **SRS-HARNESS** — Harness orchestration (16 requirements) - **SRS-GOLDEN** — Golden reference (23 requirements) - **SRS-STAGES** — Stage wrappers (25 requirements) - **SRS-REPORT** — Report generation (17 requirements) ## License Dual licensed under GPLv3 (open source) and commercial terms for proprietary safety-critical systems. The implementation builds on the [Murray Deterministic Computing Platform](/mdcp/) (UK Patent GB2521625.0). --- For teams building safety-critical ML systems, certifiable-harness provides the proving ground that certification demands. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. [View on GitHub](https://github.com/williamofai/certifiable-harness) · [Request Technical Brief](/contact/) --- ## certifiable-verify **Repository**: https://github.com/williamofai/certifiable-verify **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-verify/ Pipeline verification for the certifiable-* ecosystem — because 'we checked it manually' isn't certifiable When a regulator asks "how do you know the deployed model matches what was trained?", what's your answer? If it involves spreadsheets, manual checksums, or "we have a process", you have a problem. Not because those approaches don't work — but because they don't *prove* anything in a way that can be independently verified. **certifiable-verify** validates the complete provenance chain through cryptographic binding verification. [View on GitHub](https://github.com/williamofai/certifiable-verify) ## The Problem The certifiable-* ecosystem produces cryptographic commitments at every stage: - **certifiable-data** produces a Merkle root of training batches - **certifiable-training** produces a gradient chain hash - **certifiable-quant** produces an error certificate - **certifiable-deploy** produces an attestation tree - **certifiable-inference** produces a predictions hash - **certifiable-monitor** produces a ledger digest But commitments alone aren't enough. You need to verify that each commitment *binds correctly* to the previous stage. That the training hash actually includes the data Merkle root. That the deployment bundle contains the quantized model that was certified. That the chain is unbroken from input data to deployed inference. That's what certifiable-verify does. ## Two Verification Modes **Hash-Only Mode** — Fast verification using pre-computed commitments: ```c cv_config_t config; cv_config_default(&config); config.mode = CV_MODE_HASH_ONLY; cv_artifacts_t artifacts; /* Load commitments from each stage... */ cv_report_t report; cv_fault_flags_t faults = {0}; cv_result_t rc = cv_verify(&config, &artifacts, &report, &faults); if (report.pipeline_valid) { printf("Pipeline verified ✓\n"); } ``` **Full Replay Mode** — Complete re-execution for maximum assurance: ```c config.mode = CV_MODE_FULL_REPLAY; config.data_path = "training_data.csv"; config.model_path = "model.cbf"; /* Re-runs entire pipeline, compares against recorded commitments */ cv_result_t rc = cv_verify(&config, &artifacts, &report, &faults); ``` ## Six Cross-Artifact Bindings certifiable-verify validates six cryptographic bindings between pipeline stages: | Binding | From | To | What's Verified | |---------|------|----|-----------------| | 0 | data | training | Training includes data Merkle root | | 1 | training | quant | Quantization includes model hash | | 2 | quant | deploy | Bundle includes certificate hash | | 3 | deploy | inference | Inference loads correct bundle | | 4 | inference | monitor | Monitor records prediction hash | | 5 | deploy | monitor | Monitor binds to bundle root | Each binding is verified independently. If binding 2 fails but others pass, you know exactly where the chain broke. ```c cv_binding_result_t binding; cv_verify_binding(CV_BINDING_QUANT_DEPLOY, &artifacts, &binding, &faults); if (!binding.valid) { printf("Bundle doesn't match quantization certificate\n"); printf("Expected: %s\n", binding.expected_hex); printf("Found: %s\n", binding.actual_hex); } ``` ## Report Generation certifiable-verify produces machine-readable reports with self-integrity hashes: ```c cv_report_write_json(&report, "verification_report.json", &faults); ``` ```json { "version": "1.0.0", "timestamp": "2026-01-19T21:00:00Z", "pipeline_valid": true, "bindings": [ {"id": 0, "name": "data→training", "valid": true}, {"id": 1, "name": "training→quant", "valid": true}, {"id": 2, "name": "quant→deploy", "valid": true}, {"id": 3, "name": "deploy→inference", "valid": true}, {"id": 4, "name": "inference→monitor", "valid": true}, {"id": 5, "name": "deploy→monitor", "valid": true} ], "report_hash": "a3f2c8..." } ``` The `report_hash` covers the entire report content, enabling tamper detection. ## Test Coverage | Test Suite | Coverage | |------------|----------| | test_provenance | Data lineage verification | | test_training | Training chain validation | | test_quant | Certificate binding | | test_bundle | Bundle attestation | | test_inference | Prediction hash verification | | test_ledger | Audit trail integrity | | test_binding | Cross-artifact bindings | | test_report | Report generation | | test_hash | Hash utilities | | test_serialize | Serialization | All 10 test suites passing. ## Part of a Complete Pipeline certifiable-verify is stage 6 in the certifiable-* ecosystem: ``` data → training → quant → deploy → inference → monitor → verify ``` It's the final stage — the one that answers "is this pipeline intact?" ## When You Need This **Aerospace (DO-178C)**: Requires traceability from requirements through implementation to test. certifiable-verify extends this to the ML pipeline itself — proving that the deployed model traces back to specific training data through an unbroken chain of cryptographic commitments. **Medical Devices (IEC 62304)**: Class C software requires rigorous verification. certifiable-verify provides automated, repeatable verification that can be run before every deployment — not just during initial certification. **Automotive (ISO 26262)**: ASIL-D requires evidence that safety-critical software behaves as specified. certifiable-verify produces that evidence in a format that's machine-verifiable and auditable. ## Getting Started ```bash git clone https://github.com/williamofai/certifiable-verify.git cd certifiable-verify mkdir build && cd build cmake .. make ctest --output-on-failure ``` ## Documentation The repository includes formal documentation suitable for certification evidence: - **CV-MATH-001.md** — Mathematical specification (47KB) - **CV-STRUCT-001.md** — Data structure specification - **SRS-001 through SRS-008** — Software Requirements Specifications ## License Dual licensed under GPLv3 (open source) and commercial terms for proprietary safety-critical systems. The implementation builds on the [Murray Deterministic Computing Platform](/mdcp/) (UK Patent GB2521625.0). --- For teams building safety-critical ML systems, certifiable-verify provides the formal rigour that certification demands. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. [View on GitHub](https://github.com/williamofai/certifiable-verify) · [Request Technical Brief](/contact/) --- ## certifiable-monitor **Repository**: https://github.com/williamofai/certifiable-monitor **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-monitor/ Deterministic runtime monitoring — because 'the model drifted' isn't certifiable Deploying an ML model isn't the end of the story. In safety-critical systems, you need to know — with cryptographic certainty — when that model is operating outside its certified envelope. Standard monitoring tools use floating-point statistics, produce non-deterministic results, and leave no verifiable audit trail. certifiable-monitor changes that. [View on GitHub](https://github.com/williamofai/certifiable-monitor) ## The Problem When an ML model runs in production, things drift: - **Input distributions shift** — Real-world data diverges from training - **Activations exceed bounds** — Internal values go where they shouldn't - **Output patterns change** — Predictions no longer match expectations - **Faults accumulate silently** — Overflow and saturation events go unnoticed Current monitoring approaches have fundamental problems for certification: **Non-deterministic metrics.** Floating-point drift calculations produce different results on different platforms. How do you validate something that changes? **No audit trail.** When an incident occurs, there's no cryptographic proof of what the monitor observed. It's your word against the logs. **Ambiguous reactions.** "Log a warning" isn't a deterministic specification. What action, exactly, should the system take when TV exceeds 0.15? For DO-178C Level A, IEC 62304 Class C, or ISO 26262 ASIL-D certification, "the model drifted so we logged it" isn't acceptable evidence. ## The Solution certifiable-monitor provides deterministic runtime monitoring through three core mechanisms: ### 1. Fixed-Point Drift Detection All statistical metrics computed in fixed-point arithmetic: **Total Variation (TV)** — The safest detector, no logarithms required: ``` TV(p, q) = (1/2) Σ_b |p_b - q_b| ``` Output in Q0.32. Zero means identical distributions. UINT32_MAX means completely disjoint. **Jensen-Shannon Divergence (JSD)** — Symmetric divergence measure: ``` JSD(p, q) = (1/2) KL(p ∥ m) + (1/2) KL(q ∥ m) ``` Uses a 512-entry LUT for log2 computation. No floating-point. Bit-identical on x86, ARM, and RISC-V. **Population Stability Index (PSI)** — Directional sensitivity: ``` PSI(p, q) = Σ_b (p_b - q_b) ln(p_b / q_b) ``` Epsilon smoothing prevents log(0). Policy defines operational thresholds. Same inputs produce the same drift scores. Every time. Every platform. ### 2. Cryptographic Audit Ledger Every monitoring event is logged to a SHA-256 hash chain: ``` L_0 = SHA256("CM:LEDGER:GENESIS:v1" ∥ R ∥ H_P) L_t = SHA256("CM:LEDGER:v1" ∥ L_{t-1} ∥ e_t) ``` The genesis block binds to the deployment bundle root `R` and policy hash `H_P`. Every subsequent entry chains to the previous digest. Tampering is detectable. Truncation is detectable. Reordering is detectable. Post-incident analysis can replay the entire monitoring history with cryptographic verification. ### 3. Deterministic Health FSM A state machine with formally defined transitions: ``` UNINIT → INIT → ENABLED → ALARM → DEGRADED → STOPPED ``` Fault budgets define thresholds. Violations trigger transitions. Once stopped, only manual intervention restarts. No ambiguity about system state. ## What's Implemented 253 Tests Passing 11 Test Suites ~13,700 Lines of Code | Module | Purpose | Tests | |--------|---------|-------| | DVM Primitives | Saturating arithmetic, LUT log2 | 33 | | Audit Ledger | SHA-256 hash chain | 18 | | Drift Detectors | TV, JSD, PSI computation | 20 | | Policy Parser | COE JSON parsing, JCS hash | 25 | | Input Monitor | Feature envelope checking | 22 | | Activation Monitor | Layer bounds checking | 24 | | Output Monitor | Output envelope checking | 19 | | Health FSM | Monitor state machine | 19 | | Reaction Handler | Violation → action mapping | 14 | | Ledger Verification | Offline chain verification | 32 | | Bit-Identity | Cross-platform determinism | 27 | Every module traces to formal specifications in CM-MATH-001, CM-STRUCT-001, and the SRS documents. ## Usage Example ```c #include "policy.h" #include "input.h" #include "health.h" #include "ledger.h" #include "react.h" ct_fault_flags_t faults = {0}; // Load policy and initialize ledger cm_policy_t policy; cm_policy_parse(policy_json, policy_len, &policy, &faults); cm_ledger_ctx_t ledger; cm_ledger_init(&ledger); cm_ledger_genesis(&ledger, policy.bundle_root, policy.policy_hash, &faults); // Initialize monitors cm_input_ctx_t input_mon; cm_input_init(&input_mon, &policy.input); cm_health_ctx_t health; cm_health_init(&health, &policy.fault_budget); cm_health_enable(&health); // Per-inference: check input envelope cm_input_result_t result; cm_input_check(&input_mon, input_vector, num_features, &result, &faults); if (result.violations > 0) { // Log to cryptographic ledger uint8_t L_out[32]; cm_ledger_append_violation(&ledger, window_id, CM_VIOL_INPUT_RANGE, result.first_violation_idx, result.first_violation_value, result.first_violation_bound, L_out, &faults); // Get policy-defined reaction cm_reaction_t action = cm_policy_get_reaction(&policy, CM_VIOL_INPUT_RANGE); // Update health state cm_health_report_violation(&health, CM_VIOL_INPUT_RANGE); } // Check if system should halt if (cm_health_get_state(&health) == CM_HEALTH_STOPPED) { // Emergency stop — do not proceed with inference } ``` All buffers statically allocated. No malloc. Deterministic execution path. ## The Pipeline certifiable-monitor completes the deterministic ML ecosystem: ``` certifiable-data → certifiable-training → certifiable-quant → certifiable-deploy → certifiable-inference ↓ certifiable-monitor ↓ Audit Ledger ``` The monitor receives: - **From certifiable-deploy:** Bundle attestation root and policy hash - **From certifiable-inference:** Input vectors, activation values, output vectors, fault flags - **From policy:** Thresholds, envelopes, reaction mappings Six interlocking projects. One coherent vision: deterministic ML from data to monitored production. ## Why This Matters ### Medical Devices IEC 62304 Class C requires traceable, reproducible software. When a diagnostic AI flags an anomaly, the response must be deterministic. The audit trail must be verifiable. ### Autonomous Vehicles ISO 26262 ASIL-D demands provable behavior under all conditions. Input drift detection with cryptographic proof isn't optional — it's the difference between "we think the model was stable" and "here's the hash chain proving it." ### Aerospace DO-178C Level A requires complete requirements traceability. Every drift metric traces to CM-MATH-001. Every state transition traces to CM-ARCH-MATH-001. Every test traces to an SRS requirement. This is the monitoring layer that makes ML certification possible. ## Getting Started ```bash git clone https://github.com/williamofai/certifiable-monitor cd certifiable-monitor mkdir build && cd build cmake .. make make test-all # 253 tests ``` Expected output: ``` 100% tests passed, 0 tests failed out of 11 ``` ## Documentation The implementation traces to formal specifications: - **CM-MATH-001** — Mathematical foundations (drift metrics, ledger hashing, log2 LUT) - **CM-STRUCT-001** — Data structure specifications - **CM-ARCH-MATH-001** — Architecture-level math (health FSM, window semantics) - **SRS-001 through SRS-008** — Module requirements with full traceability Every function documents its traceability reference. Every test validates a specification clause. ## The Trade-Off Deterministic monitoring isn't free. Fixed-point arithmetic requires careful scaling. Hash chain updates add overhead. Static allocation means pre-sized buffers. For systems where "it probably works" is acceptable, standard monitoring tools are simpler. For systems where lives depend on the answer — where regulators demand proof, where post-incident analysis requires cryptographic verification, where "the model drifted" needs to be a traceable, reproducible, auditable event — certifiable-monitor provides the foundation. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. --- *Built by SpeyTech in the Scottish Highlands. 30 years of UNIX systems engineering applied to making ML safe enough to certify.* [View on GitHub](https://github.com/williamofai/certifiable-monitor) · [Documentation](https://github.com/williamofai/certifiable-monitor/tree/main/docs) --- ## Certifiable Deploy **Repository**: https://github.com/williamofai/certifiable-deploy **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-deploy/ Deterministic model packaging and cryptographic attestation — because 'trust me, it's the right model' isn't certifiable You've trained a model deterministically. You've quantized it with formal error bounds. Now you need to deploy it to production hardware. How do you prove the deployed model matches what was certified? How do you verify weights haven't been tampered with? How do you maintain cryptographic provenance from training to deployment? For safety-critical systems, "trust me, it's the right model" is not certifiable. **certifiable-deploy** implements the "Execution ⇒ Verification" invariant: the inference API is enabled only after measured hashes of weights and kernels match the certificate claims and attestation root. [View on GitHub](https://github.com/williamofai/certifiable-deploy) ## The Problem Deploying ML models to safety-critical systems faces fundamental challenges: 1. **Provenance**: How do you prove the deployed model matches what was certified? 2. **Integrity**: How do you verify weights haven't been tampered with in transit or at rest? 3. **Binding**: How do you ensure the model runs only on approved hardware? 4. **Auditability**: How do you maintain cryptographic proof linking deployment back to training? Standard deployment pipelines—containers, model registries, package managers—assume trust. They provide checksums for convenience, not for certification. When a regulator asks "prove this is the model you certified," a Docker image hash doesn't satisfy DO-178C. ## The Solution certifiable-deploy provides five core components that work together to create a verifiable deployment chain. ### 1. Canonical Bundle Format (CBF v1) A deterministic container with no ambient metadata. Payloads, table of contents, and attestation in a single verifiable package: ``` ┌─────────────────────────────────────┐ │ Global Header │ │ magic(4) | version(4) | offsets │ ├─────────────────────────────────────┤ │ File Payloads │ │ (raw bytes, no metadata) │ ├─────────────────────────────────────┤ │ Table of Contents │ │ entry_count | entries[] │ │ (sorted by normalized path) │ ├─────────────────────────────────────┤ │ Footer │ │ merkle_root | signature | magic │ └─────────────────────────────────────┘ ``` No timestamps. No filesystem metadata. No ambient authority. The same content produces the same bundle produces the same hash, forever. ### 2. Merkle Attestation A 4-leaf Merkle tree binds manifest, weights, certificates, and inference artifacts into a single attestation root: ``` R (root) / \ R₁ R₂ / \ / \ L_M L_W L_C L_I ``` Where: - `L_M` = domain hash of manifest - `L_W` = domain hash of weights - `L_C` = domain hash of certificate chain - `L_I` = domain hash of inference kernels The root `R` is what gets signed. Tampering with any leaf changes the root. ### 3. JCS Manifest (RFC 8785) The manifest uses JSON Canonicalization Scheme for deterministic serialization. Same content = same bytes = same hash: ```c cdm_builder_t mb; cdm_builder_init(&mb); cdm_set_mode(&mb, "deterministic"); cdm_set_created_at(&mb, 0); // No timestamps cdm_set_target(&mb, &target); cdm_set_weights_hash(&mb, &h_weights); cdm_set_certs_hash(&mb, &h_certs); cdm_set_inference_hash(&mb, &h_inference); uint8_t manifest_json[4096]; size_t manifest_len = sizeof(manifest_json); cdm_finalize_jcs(&mb, manifest_json, &manifest_len); ``` JCS ensures key ordering and numeric representation are canonical. No "semantically equivalent but different bytes" ambiguity. ### 4. Target Binding Lock bundles to specific platforms with target tuples: ``` arch-vendor-device-abi ``` Examples: - `riscv64-tenstorrent-p150-lp64d` - `x86_64-intel-xeon-sysv` - `aarch64-nvidia-orin-lp64` Wildcards (`generic`) allow bundles to match multiple devices while maintaining architecture/ABI safety. A bundle built for `x86_64-generic-cpu-sysv` runs on any x86_64 SYSV system. A bundle built for `riscv64-tenstorrent-p150-lp64d` runs only on that specific accelerator. ### 5. Runtime Loader (CD-LOAD) The loader implements a fail-closed state machine. JIT hash verification at every stage: ``` INIT → HEADER_READ → TOC_READ → MANIFEST_VERIFIED → WEIGHTS_STREAMING → WEIGHTS_VERIFIED → INFERENCE_STREAMING → INFERENCE_VERIFIED → CHAIN_VERIFIED → ENABLED Any State --[error]--> FAILED (terminal) ``` Any verification failure immediately transitions to `FAILED`, which cannot be exited. The inference API is unreachable without completing the full verification chain. ```c cd_load_ctx_t ctx; cd_target_t device_target; // Set device target cdt_set(&device_target, CD_ARCH_X86_64, "intel", "xeon", CD_ABI_SYSV); // Initialize loader cdl_init(&ctx, &device_target); // Open bundle (verifies header, TOC, manifest, target) cdl_open_bundle(&ctx, bundle_data, bundle_len); // Load weights with JIT hash verification cdl_load_weights(&ctx, weights, weights_size); // Load inference kernels with JIT hash verification cdl_load_kernels(&ctx, kernels, kernel_size); // Finalize (verifies Merkle root) cdl_finalize(&ctx); // Only now is execution permitted if (cdl_is_enabled(&ctx)) { run_inference(weights, kernels); } ``` ## Domain-Separated Hashing All hashes use domain separation to prevent cross-protocol attacks: ``` DH(tag, payload) = SHA256(tag || LE64(|payload|) || payload) ``` Domain tags include: - `CD:MANIFEST:v1` — Manifest hash - `CD:WEIGHTS:v1` — Weights hash - `CD:CERTSET:v1` — Certificate chain hash - `CD:INFERSET:v1` — Inference set hash - `CD:LEAF:*:v1` — Merkle leaf hashes - `CD:MERKLENODE:v1` — Merkle internal nodes A weights file cannot be substituted for a manifest, even if they happen to have the same content. The domain tag makes them cryptographically distinct. 201 Tests 7/7 Test Suites CBF v1 Format RFC 8785 JCS ## What's Implemented All modules complete — 201 tests passing across 7 test suites: | Module | Tests | Coverage | |--------|-------|----------| | Audit | 18 | SHA-256, domain-separated hashing | | Attest | 18 | Merkle tree construction, attestation | | Bundle | 37 | CBF v1 builder, reader, format compliance | | Manifest | 57 | JCS canonicalization, parsing, roundtrip | | Target | 32 | Parse, encode, match, wildcard | | Verify | 23 | Offline bundle verification | | Loader | 16 | Runtime JIT verification, state machine | ## The Complete Pipeline certifiable-deploy bridges the gap between quantization and inference: ``` certifiable-data → certifiable-training → certifiable-quant → certifiable-deploy → certifiable-inference ↓ ↓ ↓ ↓ ↓ Load data Train model Quantize model Package bundle Run inference Normalise Merkle chain Error bounds Attestation Bit-perfect Shuffle Audit trail Certificate Target binding execution ``` Each project maintains the same principles: pure C99, zero dynamic allocation, formal verification patterns, and cryptographic audit trails. ## Why It Matters ### Medical Devices IEC 62304 Class C requires traceable, reproducible software. When a pacemaker's ML model makes a decision, you need to prove it's running exactly what was validated. certifiable-deploy provides cryptographic evidence linking the deployed binary to the certified source. ### Autonomous Vehicles ISO 26262 ASIL-D demands provable behaviour. The perception model that passed simulation must be bit-identical to the perception model in the vehicle. Target binding ensures the certified bundle only runs on approved hardware configurations. ### Aerospace DO-178C Level A requires complete requirements traceability. "We deployed the model" satisfies no one. The Merkle attestation provides cryptographic proof linking each deployment artifact back through quantization, training, and data preparation. ## Integration Points certifiable-deploy is designed for integration with existing security infrastructure: **Ed25519 Signing**: The `cda_sign()` function provides the interface for signing attestation roots. Integrators provide their own Ed25519 implementation appropriate for their security requirements (HSM, libsodium, certified library). **Certificate Chain**: Certificate parsing requires integration with the deployer's PKI infrastructure. The certificate format is defined by the upstream certifiable-* pipeline. ## Getting Started ```bash git clone https://github.com/williamofai/certifiable-deploy cd certifiable-deploy mkdir build && cd build cmake .. make make test-all ``` Expected output: ``` 100% tests passed, 0 tests failed out of 7 Total Test time (real) = 0.02 sec ``` ## Documentation The repository includes formal documentation suitable for certification evidence: - **CD-MATH-001.md** — Mathematical foundations - **CD-STRUCT-001.md** — Data structure specifications - **SRS documents** — Software Requirements Specifications: - SRS-001-BUNDLE — CBF v1 format - SRS-002-ATTEST — Merkle attestation - SRS-003-TARGET — Target binding - SRS-004-MANIFEST — JCS canonicalization - SRS-005-VERIFY — Offline verification - SRS-006-LOADER — Runtime loader ## License Dual licensed under GPLv3 (open source) and commercial terms for proprietary safety-critical systems. The implementation builds on the [Murray Deterministic Computing Platform](/mdcp/) (UK Patent GB2521625.0). --- For teams deploying ML to safety-critical systems, certifiable-deploy provides the cryptographic rigour that certification demands. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. [View on GitHub](https://github.com/williamofai/certifiable-deploy) · [Request Technical Brief](/contact/) --- ## Certifiable Quant **Repository**: https://github.com/williamofai/certifiable-quant **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-quant/ Deterministic model quantization with formal error certificates for safety-critical ML Standard quantization tools treat model compression as an optimisation problem: shrink the weights, benchmark accuracy, ship it. For most applications, that's fine. For safety-critical systems—where a misclassified obstacle or miscalculated dosage can cost lives—"approximately equivalent" isn't certifiable. **certifiable-quant** provides provable FP32→Q16.16 quantization with formal error bounds and cryptographic audit trails. [View on GitHub](https://github.com/williamofai/certifiable-quant) ## The Problem with Black-Box Quantization When you run a neural network through TensorFlow Lite or an ONNX quantizer, you get a smaller model. What you don't get: - **Error bounds**: How far can the quantized output deviate from the original? - **Reproducibility**: Will the same input produce identical output on different hardware? - **Audit trail**: Can you prove, years later, which original model produced this quantized version? For DO-178C Level A aerospace software, IEC 62304 Class C medical devices, or ISO 26262 ASIL-D automotive systems, these aren't nice-to-haves. They're certification requirements. ## How certifiable-quant Works The pipeline has five stages, each producing a cryptographic digest that feeds into a final Merkle-rooted certificate. ### 1. Analyze Before touching any weights, compute theoretical error bounds: ```c cq_analysis_ctx_t analysis; cq_analysis_init(&analysis, num_layers); cq_compute_weight_range(weights, n, &w_min, &w_max); cq_compute_overflow_proof(w_max, x_max, n, &overflow_proof); cq_analysis_digest_generate(&analysis, &analysis_digest); ``` This establishes the entry error (ε₀ = 2⁻¹⁷ for Q16.16) and computes Lipschitz constants for layer-by-layer error propagation. ### 2. Calibrate Collect runtime statistics from representative data: ```c cq_tensor_stats_t stats; cq_tensor_stats_init(&stats, CQ_FORMAT_Q16); for (int i = 0; i < num_samples; i++) { cq_tensor_stats_update(&stats, activations[i], size); } cq_calibration_digest_generate(&calib, &calib_digest); ``` The calibration module tracks min/max ranges, coverage percentages, and flags degenerate distributions that might indicate insufficient calibration data. ### 3. Convert Quantize with explicit fault detection: ```c cq_convert_ctx_t convert; cq_convert_init(&convert, CQ_FORMAT_Q16); cq_quantize_tensor(fp32_weights, q16_weights, n, &convert, &faults); ``` All arithmetic uses widening multiplication and Round-to-Nearest-Even (RNE) rounding—no silent overflow, no undefined behaviour. ### 4. Verify Check quantized values against theoretical bounds: ```c cq_verify_ctx_t verify; cq_verify_init(&verify, &analysis); bool passed = cq_verify_tensor(q16_weights, &analysis.contracts[layer], &verify); cq_verify_digest_generate(&verify, &verify_digest); ``` If any value exceeds its computed bound, verification fails. No silent degradation. ### 5. Certificate Generate the final proof object: ```c cq_certificate_builder_t builder; cq_certificate_builder_init(&builder); cq_certificate_builder_set_analysis(&builder, &analysis_digest); cq_certificate_builder_set_calibration(&builder, &calib_digest); cq_certificate_builder_set_verification(&builder, &verify_digest); cq_certificate_t cert; cq_certificate_build(&builder, &cert, &faults); ``` The certificate contains a Merkle root linking all three digests. Years later, you can prove exactly which analysis, calibration, and verification produced this quantized model. ## Error Bound Theory Q16.16 uses 16 fractional bits, giving an entry error of: ``` ε₀ = 2^(-f-1) = 2^(-17) ≈ 7.6 × 10⁻⁶ ``` Error propagates through layers according to: ``` ε_{ℓ+1} = ρ_ℓ · ε_ℓ + δ_ℓ ``` Where ρ_ℓ is the layer's Lipschitz constant (operator norm) and δ_ℓ is the layer's quantization error. The analysis module computes these bounds before conversion begins—if the theoretical maximum error exceeds tolerance, you know before wasting compute on quantization. ## The Fault Model Every operation signals faults explicitly: ```c typedef struct { uint32_t overflow : 1; /* Saturated high */ uint32_t underflow : 1; /* Saturated low */ uint32_t div_zero : 1; /* Division by zero */ uint32_t domain : 1; /* Invalid input */ uint32_t precision : 1; /* Precision loss detected */ } ct_fault_flags_t; ``` No silent failures. If something goes wrong, the fault flags tell you what and where. ## Test Coverage All core modules are complete with 150 tests across 7 test suites: | Module | Tests | Coverage | |--------|-------|----------| | Analyze | 30 | Overflow proofs, range propagation, norms | | Bit Identity | 9 | RNE patterns, SHA-256 vectors, cross-platform | | Calibrate | 28 | Statistics, coverage, degenerate handling | | Certificate | 27 | Builder, Merkle, serialization, roundtrip | | Convert | 21 | RNE quantization, BatchNorm folding, dyadic | | Primitives | 13 | Arithmetic, saturation, overflow safety | | Verify | 22 | Bound checking, L∞ norm, contract validation | ## Part of a Complete Pipeline certifiable-quant is one component of a deterministic ML ecosystem: **certifiable-data** → **certifiable-training** → **certifiable-quant** → **certifiable-deploy** → **certifiable-inference** Each project maintains the same principles: pure C99, zero dynamic allocation, formal error tracking, and cryptographic audit trails. Together, they provide end-to-end traceability from training data through deployed inference. ## When You Need This **Medical devices**: IEC 62304 Class C requires traceable, validated transformations. "We quantized it and it still works" isn't evidence. **Autonomous vehicles**: ISO 26262 ASIL-D demands provable error bounds. Unbounded quantization error is a safety hazard. **Aerospace**: DO-178C Level A requires complete requirements traceability. Every weight transformation must be auditable. If your model makes decisions that affect human safety, black-box quantization creates certification risk. certifiable-quant is designed to address that gap. ## Getting Started ```bash git clone https://github.com/williamofai/certifiable-quant cd certifiable-quant mkdir build && cd build cmake .. make make test ``` Expected output: ``` 100% tests passed, 0 tests failed out of 7 Total Test time (real) = 0.02 sec ``` ## Documentation The repository includes formal documentation suitable for certification evidence: - **CQ-MATH-001.md** — Mathematical foundations and error theory - **CQ-STRUCT-001.md** — Data structure specifications - **SRS documents** — Software Requirements Specifications for each module ## License Dual licensed under GPLv3 (open source) and commercial terms for proprietary safety-critical systems. The implementation builds on the [Murray Deterministic Computing Platform](/mdcp/) (UK Patent GB2521625.0). --- For teams building safety-critical ML systems, certifiable-quant provides the formal rigour that certification demands. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. [View on GitHub](https://github.com/williamofai/certifiable-quant) · [Request Technical Brief](/contact/) --- ## C-From-Scratch **Repository**: https://github.com/williamofai/c-from-scratch **License**: MIT **Documentation**: https://speytech.com/open-source/c-from-scratch/ Learn to build safety-critical systems in C — mathematical rigour, not 'Hello World' Most programming courses start with "Hello World" and work up to toy applications. They teach syntax but not thinking. They produce programmers who can write code that compiles, but not code that can be trusted when lives depend on it. C-From-Scratch takes the opposite approach. We start with the question: **what would it take to build software for a pacemaker?** Then we work backwards to the fundamentals. ## The Methodology Core Principle "Sensors report. The Captain decides." Every module follows the same pattern: ``` Problem → Math Model → Proof → Structs → Code → Verification │ │ │ │ │ │ │ │ │ │ │ └── Does code match math? │ │ │ │ └── Direct transcription │ │ │ └── Every field justified │ │ └── Contracts proven │ └── State machine defined └── Failure modes analysed ``` This is the "Math → Structs → Code" methodology. We don't write code until we can prove properties about our design. The code is a transcription of the math, not an invention. ## The Safety Stack Seven modules build from simple monitors to a complete safety-critical system: 7 Modules 42 Lessons 98 Tests MIT License ### The Sensors (Modules 1-4) **Module 1: Pulse** — Is the signal alive? The simplest possible monitor. A signal either arrives within a timeout or it doesn't. Two states, deterministic transitions, provable properties. This is the foundation everything else builds on. **Module 2: Baseline** — Is the signal normal? Exponential moving averages with O(1) memory. Statistical anomaly detection without storing history. We prove bounds on memory usage and numerical stability. **Module 3: Timing** — Is the rhythm healthy? Interval analysis and jitter detection. When a heartbeat should arrive every 800ms ± 50ms, we need to detect when intervals drift outside tolerance. **Module 4: Drift** — Trending toward failure? Long-term trend detection. A sensor might be "normal" right now but drifting toward failure. Linear regression with bounded memory and deterministic computation. ### The Decision Layer (Modules 5-7) **Module 5: Consensus** — Which sensor to trust? Triple Modular Redundancy (TMR) voting. When three sensors disagree, how do you decide which one is right? Byzantine fault tolerance in 200 lines of C. **Module 6: Pressure** — Handle overflow Bounded queues with backpressure. When events arrive faster than you can process them, what do you drop? We prove the queue never overflows and never loses critical events. **Module 7: Mode** — The Captain System mode management with permission matrices. Some actions are only safe in certain modes. Transitions between modes must be controlled. This is where everything comes together. ### The Composition ``` Sensors (1-4) → Judge (5) → Buffer (6) → Captain (7) ``` Each module is independently verified. The composition preserves the properties of the components. This is how you build complex systems you can trust. ## What You'll Learn This isn't a syntax course. You'll learn: - **Failure mode analysis** — Enumerate what can go wrong before writing code - **State machine design** — Define all states and all transitions explicitly - **Property proofs** — Prove bounds, invariants, and safety properties - **Data structure justification** — Every field serves a proven purpose - **TMR voting** — Implement Byzantine fault tolerance - **Backpressure handling** — Manage overflow without losing critical data - **Mode management** — Control system behaviour through state transitions - **Compositional verification** — Build complex systems from verified components ## Example: The Pulse Monitor Here's what a lesson looks like. The Pulse monitor answers one question: is the signal alive? ### The Math ``` State: { ALIVE, DEAD } Input: { tick, heartbeat } Transition: ALIVE + tick → if (ticks_since_heartbeat > TIMEOUT) then DEAD else ALIVE ALIVE + heartbeat → ALIVE (reset counter) DEAD + heartbeat → ALIVE DEAD + tick → DEAD ``` ### The Proof Property: "If heartbeats arrive faster than TIMEOUT, the system stays ALIVE." Proof: Each heartbeat resets the counter. If heartbeats arrive every T ticks where T state == PULSE_ALIVE) { pm->ticks_since_heartbeat++; if (pm->ticks_since_heartbeat > pm->timeout) { pm->state = PULSE_DEAD; } } return pm->state; } pulse_state_t pulse_heartbeat(pulse_monitor_t *pm) { pm->ticks_since_heartbeat = 0; pm->state = PULSE_ALIVE; return pm->state; } ``` The code is a direct transcription of the state machine. No creativity required — creativity in safety-critical code is a bug. ### The Test ```c void test_timeout_triggers_dead(void) { pulse_monitor_t pm; pulse_init(&pm, 3); // Timeout after 3 ticks assert(pulse_tick(&pm) == PULSE_ALIVE); // tick 1 assert(pulse_tick(&pm) == PULSE_ALIVE); // tick 2 assert(pulse_tick(&pm) == PULSE_ALIVE); // tick 3 assert(pulse_tick(&pm) == PULSE_DEAD); // tick 4 - timeout! } ``` The test verifies the proof. If the test passes, the code matches the math. ## Why This Approach Works Traditional programming education teaches you to write code that works. This approach teaches you to write code that **cannot fail** — because failure modes are eliminated by design, not by testing. Testing can only show the presence of bugs, not their absence. Proofs show absence. When you prove a property about your state machine, it holds for all possible inputs, not just the ones you thought to test. This is how aerospace software is written. This is how medical device software is written. And now it's how you can learn to write software. ## Getting Started ```bash git clone https://github.com/williamofai/c-from-scratch cd c-from-scratch make make test ``` Start with Module 1 (Pulse). Each module builds on the previous. By Module 7, you'll have built a complete safety-critical monitoring system. ## Community Response The course has resonated with engineers who want more than syntax: > "Finally, a C course that treats the language seriously. The math-first approach changed how I think about all my code." > "I've been writing C for 15 years and I learned more about safety-critical design in Module 5 than in my entire career." The repository has seen significant engagement on GitHub, and the LinkedIn posts about the methodology have reached over 66,000 impressions. Turns out there's demand for rigour. --- *Prove first, code second. MIT licensed.* --- ## Certifiable Data **Repository**: https://github.com/williamofai/certifiable-data **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-data/ Deterministic data pipelines for safety-critical ML — because 'we shuffled the data' isn't reproducible Standard ML data pipelines are a major source of non-determinism. Floating-point normalisation varies across platforms. Random shuffling produces different orders each run. Data augmentation introduces uncontrolled variation. When you can't reproduce your data pipeline, you can't reproduce your training. When you can't reproduce your training, you can't certify your model. `certifiable-data` makes data loading a pure function: `B_t = Pipeline(D, seed, epoch, t)`. Given the same dataset, seed, and indices, you get the same batch — bit for bit, every time. ## The Problem Consider a typical PyTorch data loader: ```python loader = DataLoader(dataset, shuffle=True, num_workers=4) ``` This single line introduces multiple sources of non-determinism: 1. **Shuffle order** depends on PRNG state, which depends on when you called it 2. **Floating-point normalisation** varies by platform 3. **Worker processes** may return batches in different orders 4. **Augmentation** (if any) introduces random transformations For research, this doesn't matter. For safety-critical systems, it's disqualifying. ## The Solution ### Deterministic Normalisation Standard normalisation uses floating-point: `y = (x - mean) / std`. The division introduces platform-dependent rounding. `certifiable-data` uses fixed-point with precomputed inverse: ```c // Q16.16 fixed-point normalisation // y = (x - μ) * (1/σ) // All operations use DVM primitives with RNE rounding int32_t normalise(int32_t x, int32_t mean, int32_t inv_std, ct_fault_flags_t *faults) { int64_t diff = (int64_t)x - (int64_t)mean; int64_t product = diff * (int64_t)inv_std; return dvm_round_shift_rne(product, 16, faults); } ``` The result is deterministic because: - All arithmetic is integer (no floating-point) - Rounding uses Round-to-Nearest-Even (RNE), explicitly - Overflow is handled by saturation with fault flags ### Feistel Shuffling Standard shuffling (Fisher-Yates) requires sequential access and maintains internal state. Different execution orders produce different shuffles. We use a Cycle-Walking Feistel network — a cryptographic permutation that maps any index to its shuffled position in O(1) time: ```c uint32_t permute_index(uint32_t index, uint32_t N, uint64_t seed, uint32_t epoch) { // Feistel network with cycle-walking // π: [0, N-1] → [0, N-1] (bijection) // Same (seed, epoch, index) → same output, always } ``` Test vectors from CT-MATH-001 §7.2: ``` N=100, seed=0x123456789ABCDEF0, index=0 → 26 N=100, seed=0x123456789ABCDEF0, index=99 → 41 N=60000, seed=0xFEDCBA9876543210, index=0 → 26382 ``` The permutation is a true bijection — every input maps to exactly one output, and every output comes from exactly one input. ### Deterministic Augmentation Data augmentation typically uses random number generators. We use counter-based PRNG with explicit operation IDs: ```c // Horizontal flip (50% probability) uint64_t rng = ct_prng(seed, epoch, sample_idx 142 Tests 8/8 Test Suites O(1) Shuffle Time RNE Rounding ## What's Implemented All core modules complete — 142 tests passing across 8 test suites: | Module | Tests | Coverage | |--------|-------|----------| | DVM Primitives | 38 | CT-MATH-001 §3 test vectors | | PRNG | 13 | Determinism, distribution quality | | Shuffle | 19 | Bijection verification | | Normalise | 13 | Correctness, overflow handling | | Augment | 10 | Flip, crop, noise | | Batch | 12 | Construction, verification | | Merkle | 20 | Hashing, provenance chain | | Bit-Identity | 17 | Cross-platform verification | ## Usage Example ```c #include "ct_types.h" #include "loader.h" #include "normalize.h" #include "shuffle.h" #include "batch.h" #include "merkle.h" // Pre-allocated buffers ct_sample_t dataset_samples[60000]; ct_dataset_t dataset = { .samples = dataset_samples, .num_samples = 60000 }; ct_fault_flags_t faults = {0}; // Load data (deterministic decimal parsing) ct_load_csv("mnist.csv", &dataset, &faults); // Setup normalisation ct_normalize_ctx_t norm_ctx; ct_normalize_init(&norm_ctx, means, inv_stds, 784); // Create batch via deterministic shuffle ct_batch_t batch; ct_batch_init(&batch, batch_samples, batch_hashes, 32); ct_batch_fill(&batch, &dataset, batch_index, epoch, seed); // Verify integrity int valid = ct_batch_verify(&batch); // Initialize provenance chain ct_provenance_t prov; ct_provenance_init(&prov, dataset_hash, config_hash, seed); ``` ## The Complete Pipeline `certifiable-data` completes the deterministic ML pipeline: ``` certifiable-data → certifiable-training → certifiable-inference ↓ ↓ ↓ Load data Train model Deploy model Normalise Merkle chain Bit-perfect Shuffle Audit trail inference Batch ``` Every step is deterministic. Every step is auditable. The same seed produces the same model produces the same predictions, forever. ## Why It Matters ### Reproducibility Crisis The ML reproducibility crisis is well-documented. Papers can't be replicated. Models can't be reconstructed. Part of the problem is non-deterministic data pipelines — you can't reproduce training if you can't reproduce the exact data order. ### Certification Requirements IEC 62304 Class C (medical devices) requires traceable software. DO-178C Level A (aerospace) requires complete requirements traceability. "We shuffled the data randomly" satisfies neither. With `certifiable-data`, you can prove: - Exactly what data was used - In exactly what order - With exactly what transformations - And verify it cryptographically ### Debugging Training When training fails or produces unexpected results, you need to understand what happened. With deterministic data loading, you can replay exact batches and inspect exact transformations. No "it worked differently yesterday" mysteries. ## Getting Started ```bash git clone https://github.com/williamofai/certifiable-data cd certifiable-data mkdir build && cd build cmake .. make make test ``` Expected output: ``` 100% tests passed, 0 tests failed out of 8 Total Test time (real) = 0.04 sec ``` ## Documentation - **CT-MATH-001.md** — Mathematical foundations (normalisation, Feistel, Merkle) - **CT-STRUCT-001.md** — Data structure specifications - **docs/requirements/** — SRS documents with full traceability --- *Data loading as a pure function. Merkle-proven provenance. GPL-3.0 licensed.* --- ## C-Sentinel **Repository**: https://github.com/williamofai/c-sentinel **License**: MIT **Documentation**: https://speytech.com/open-source/c-sentinel/ Semantic observability for UNIX systems — lightweight system probing with explainable risk scoring Most security monitoring tools are black boxes. They tell you something is wrong, but not why. They flag anomalies without explaining the reasoning. When the alert fires at 3 AM, you're left guessing. C-Sentinel takes a different approach: **semantic observability**. Every risk score comes with an explanation. Every alert tells you exactly which factors contributed and how much weight each carried. ## The Problem with Traditional Monitoring Traditional security monitoring operates on pattern matching. A process opens a network connection — is that suspicious? It depends on context that pattern matchers can't see. Consider a system administrator running `curl` to download a patch. Normal behaviour. Now consider the same command executed by a web server process. That's worth investigating. The difference isn't in the action itself but in **who** is performing it, **when**, and **why** it deviates from established baselines. This requires semantic understanding, not just pattern matching. ## How C-Sentinel Works C-Sentinel captures "system fingerprints" — snapshots of system state that include process trees, network connections, file system changes, and user activity. These fingerprints are then analysed for anomalies using explainable scoring. ### Explainable Risk Scoring Every risk score is decomposed into contributing factors: ``` Risk Score: 73/100 (High) Contributing Factors: ├── Process ancestry anomaly: +25 │ └── /usr/bin/curl spawned by httpd (unusual parent) ├── Network connection timing: +18 │ └── Connection established outside business hours ├── File access pattern: +15 │ └── Accessed /etc/shadow (sensitive file) └── User context: +15 └── Action performed by service account ``` You don't just see the score — you see the reasoning. This transforms alert investigation from guesswork into directed analysis. ### Auditd Integration C-Sentinel integrates with the Linux audit framework to provide process chain attribution. When an alert fires, you can trace the exact sequence of events: ``` Process Chain: systemd → httpd → /bin/sh → curl → [network connection] Timeline: 14:23:01 httpd receives request from 192.168.1.45 14:23:01 httpd spawns /bin/sh with args "-c curl ..." 14:23:02 curl connects to 203.0.113.50:443 14:23:02 curl writes to /tmp/.cache_xyz ``` This attribution chain is essential for incident response. You can see not just what happened, but how the system got into that state. ## Architecture C-Sentinel is written in pure C with zero external dependencies. The entire binary compiles to 104KB. 104KB Binary Size Zero Dependencies <1% CPU Overhead O(1) Memory per Probe This matters for deployment in constrained environments. No Python runtime. No Node.js. No container orchestration. Just a single binary that runs anywhere Linux runs. ### Web Dashboard The dashboard provides real-time visibility with multi-user authentication and role-based access control: - **Two-factor authentication** with TOTP and QR code setup - **Role-based access** — separate permissions for viewers, analysts, and administrators - **Public demo mode** for showcasing without exposing sensitive data - **Email and Slack alerts** with rich formatting ### Security Posture Summary Beyond individual alerts, C-Sentinel generates plain-English summaries of overall system security posture: > "System posture: Moderate risk. Three service accounts have accessed sensitive files in the past 24 hours. Network connection patterns show 12% deviation from baseline, primarily due to increased API calls to external services. No privilege escalation attempts detected." This summary helps communicate security status to non-technical stakeholders without drowning them in technical detail. ## Design Philosophy C-Sentinel follows the same principles as our commercial safety-critical work: **Minimal footprint.** The smaller the codebase, the smaller the attack surface. Every line of code is a potential vulnerability. **Explainable behaviour.** Security tools that operate as black boxes are security liabilities. If you can't explain why an alert fired, you can't trust it. **No external dependencies.** Dependencies are inherited vulnerabilities. C-Sentinel's only dependency is the Linux kernel itself. **Deterministic operation.** Given the same system state, C-Sentinel produces the same risk score. This enables regression testing and makes behaviour predictable. ## Getting Started ```bash git clone https://github.com/williamofai/c-sentinel cd c-sentinel make ./c-sentinel --config /etc/c-sentinel/config.yaml ``` The default configuration provides sensible defaults for most Linux systems. Customisation is available for specific environments. ## Roadmap Current development focuses on: - **Solaris port** — returning to my Solaris 2.6 roots - **macOS support** — extending coverage to development workstations - **Container-aware probing** — understanding container boundaries in risk assessment - **Historical trend analysis** — detecting slow-moving attacks that evade point-in-time detection ## Why Open Source? Security through obscurity doesn't work. The only way to build trustworthy security tools is to make them auditable. By releasing C-Sentinel under the MIT license, anyone can verify that it does what it claims and nothing more. The open source also outlives us. When I'm gone, the code remains. Others can build on it, improve it, adapt it to their needs. That's the point. --- *Built in the Scottish Highlands. MIT licensed. Zero dependencies.* --- ## Certifiable Inference **Repository**: https://github.com/williamofai/certifiable-inference **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-inference/ Deterministic, bit-perfect neural network inference for safety-critical systems Most AI infrastructure is built for research, where "mostly reproducible" is good enough. In aerospace, medical devices, and autonomous vehicles, non-determinism isn't just a bug — it's a liability. Modern ML inference engines like TensorFlow Lite, ONNX Runtime, and PyTorch Mobile are fundamentally non-deterministic. Floating-point operations vary across platforms. Hash table iteration is unpredictable. Memory allocation affects behaviour. For safety-critical systems, you cannot prove correctness if behaviour varies. ## The Three Pillars `certifiable-inference` replaces the "black box" of modern ML with a deterministic pipeline: ### 1. Exact Math Fixed-point arithmetic (Q16.16) ensures `Input A + Model B = Output C` across all platforms — x86, ARM, RISC-V — forever. No floating-point drift. No platform-dependent rounding. ```c typedef int32_t fixed_t; #define FIXED_SHIFT 16 #define FIXED_ONE (1 Q16.16 Fixed-Point Zero malloc() calls <5% P95 Jitter C99 Standard ## What's Implemented The engine provides the core building blocks for convolutional neural networks: - **Fixed-point arithmetic** — Q16.16 with deterministic rounding - **Matrix operations** — Multiply, transpose, element-wise ops - **2D Convolution** — Zero dynamic allocation, O(OH×OW×KH×KW) - **Activation functions** — ReLU, deterministic thresholding - **Max Pooling** — 2×2 stride-2 dimension reduction - **Timing verification** — Proven <5% jitter at 95th percentile ## Usage Example ```c #include "matrix.h" #include "convolution.h" #include "activation.h" #include "pooling.h" // Pre-allocated buffers (no malloc) fixed_t input_buf[256]; // 16×16 input fixed_t kernel_buf[9]; // 3×3 kernel fixed_t conv_buf[196]; // 14×14 after conv fixed_t pool_buf[49]; // 7×7 after pooling // Initialize matrices fx_matrix_t input, kernel, conv_out, pool_out; fx_matrix_init(&input, input_buf, 16, 16); fx_matrix_init(&kernel, kernel_buf, 3, 3); fx_matrix_init(&conv_out, conv_buf, 14, 14); fx_matrix_init(&pool_out, pool_buf, 7, 7); // Inference pipeline fx_conv2d(&input, &kernel, &conv_out); // Convolution fx_relu(&conv_out, &conv_out); // Activation fx_maxpool_2x2(&conv_out, &pool_out); // Pooling // Result: 7×7 feature map, bit-perfect across all platforms ``` ## Why It Matters ### Medical Devices A pacemaker must deliver a signal within a 10ms window. Non-deterministic timing causes life-threatening delays. With fixed-point arithmetic and bounded execution time, timing is provable. ### Autonomous Vehicles ISO 26262 ASIL-D requires provable worst-case execution time (WCET). Floating-point variance makes this impossible — the same operation can take different amounts of time depending on input values. Fixed-point operations have constant timing. ### Aerospace DO-178C Level A demands complete requirements traceability. "The model runs inference" is not certifiable. "The model executes exactly these operations in exactly this order with exactly this memory footprint" is certifiable. ## The Certifiable Pipeline `certifiable-inference` is part of a complete deterministic ML pipeline: | Project | Purpose | Status | |---------|---------|--------| | [certifiable-data](https://github.com/williamofai/certifiable-data) | Deterministic data loading and augmentation | ✅ Released | | [certifiable-training](https://github.com/williamofai/certifiable-training) | Deterministic training with Merkle audit trail | ✅ Released | | **certifiable-inference** | Deterministic inference | ✅ Released | Together, these provide end-to-end determinism: from data loading through training to deployment. The same seed produces the same model produces the same predictions, forever. ## Certification Support The implementation is designed to support certification under: - **DO-178C** — Aerospace software (Level A capable) - **IEC 62304** — Medical device software (Class C capable) - **ISO 26262** — Automotive functional safety (ASIL-D capable) - **IEC 61508** — Industrial safety systems (SIL-4 capable) Documentation includes Software Requirements Specifications (SRS), verification methods, and traceability matrices. ## Getting Started ```bash git clone https://github.com/williamofai/certifiable-inference cd certifiable-inference mkdir build && cd build cmake .. make make test ``` ## Roadmap - **Model loader** — ONNX import for pre-trained models - **Quantisation tools** — FP32 to Q16.16 conversion with error bounds - **Additional layers** — Batch normalisation, additional pooling modes ## Patent Protection This implementation is built on the Murray Deterministic Computing Platform (MDCP), protected by UK Patent GB2521625.0. Commercial licensing available for proprietary use in safety-critical systems. --- *Same input. Same output. Always. GPL-3.0 licensed.* --- ## Certifiable Training **Repository**: https://github.com/williamofai/certifiable-training **License**: GPL-3.0 **Documentation**: https://speytech.com/open-source/certifiable-training/ Deterministic ML training with Merkle audit trails — because 'we trained it' isn't certifiable Training machine learning models is inherently non-deterministic. Floating-point operations vary across platforms. Parallel reductions produce different results each run. Data shuffling depends on random number generators that aren't truly random. For safety-critical systems, you cannot certify what you cannot reproduce. `certifiable-training` redefines training as a **deterministic state evolution**. Training becomes a pure function: `θ_T = T^(T)(θ_0, D, seed)`. Given the same initial weights, data, and seed, you get the same final model — bit for bit, every time, on every platform. ## The Four Mechanisms ### 1. Fixed-Point Arithmetic Q16.16 for weights, Q8.24 for gradients, Q32.32 for accumulators. Same math, same result, every platform. | Format | Use Case | Range | Precision | |--------|----------|-------|-----------| | Q16.16 | Weights, activations | ±32768 | 1.5×10⁻⁵ | | Q8.24 | Gradients | ±128 | 5.9×10⁻⁸ | | Q32.32 | Accumulators | ±2³¹ | 2.3×10⁻¹⁰ | The higher precision for gradients prevents small updates from being rounded away. The wide accumulators prevent overflow during summation. ### 2. Deterministic Reduction Parallel gradient reduction is a major source of non-determinism. The order of floating-point additions affects the result. Different thread scheduling produces different sums. `certifiable-training` uses fixed tree topology with Neumaier compensated summation: ```c // Fixed reduction tree — same topology every time // [sum] // / \ // [a+b] [c+d] // / \ / \ // a b c d ``` The tree topology is determined at compile time, not runtime. Combined with compensated arithmetic, the reduction is deterministic regardless of hardware parallelism. ### 3. Reproducible "Randomness" Data shuffling and dropout require random numbers. But standard PRNGs maintain internal state that varies with execution order. We use counter-based PRNG: `PRNG(seed, op_id, step) → deterministic bits`. The output depends only on the inputs, not on hidden state. Same seed produces same sequence. For data shuffling, we use a Cycle-Walking Feistel network that provides true bijection for any dataset size: ``` π: [0, N-1] → [0, N-1] (one-to-one and onto) ``` This isn't "shuffle and hope" — it's a cryptographic permutation with proven properties. ### 4. Merkle Audit Trail Every training step produces a cryptographic commitment: ``` h_t = SHA256(h_{t-1} || H(θ_t) || H(B_t) || t) ``` Where: - `h_{t-1}` is the previous hash (chain link) - `H(θ_t)` is the hash of current weights - `H(B_t)` is the hash of the current batch - `t` is the step number Any step can be independently verified. If you claim "model X was produced by training Y," the Merkle chain proves it — or proves you're lying. 10/10 Test Suites SHA-256 Audit Chain Feistel Shuffle O(1) Verify Time ## What's Implemented All core modules complete — 10/10 test suites passing: | Module | Description | |--------|-------------| | DVM Primitives | Fixed-point arithmetic with fault detection | | Counter-based PRNG | Deterministic pseudo-random generation | | Compensated Summation | Neumaier algorithm for precision | | Reduction Tree | Fixed-topology parallel reduction | | Forward Pass | Q16.16 activations (ReLU, sigmoid, tanh) | | Backward Pass | Q8.24 gradient computation | | Optimizers | SGD, Momentum, Adam | | Merkle Chain | SHA256 audit trail with checkpoints | | Data Permutation | Cycle-Walking Feistel bijection | | Bit Identity | Cross-platform reproducibility tests | ## The Fault Model Every arithmetic operation can overflow, underflow, or divide by zero. Traditional code either ignores these (undefined behaviour) or throws exceptions (non-deterministic control flow). `certifiable-training` uses sticky fault flags: ```c typedef struct { uint32_t overflow : 1; // Saturated high uint32_t underflow : 1; // Saturated low uint32_t div_zero : 1; // Division by zero uint32_t domain : 1; // Invalid input uint32_t precision : 1; // Precision loss } ct_fault_flags_t; ``` Operations saturate rather than overflow. Faults are recorded but execution continues deterministically. If any fault occurs during a training step, the Merkle chain is invalidated — you know something went wrong, and you know exactly when. ## Usage Example ```c #include "ct_types.h" #include "forward.h" #include "backward.h" #include "optimizer.h" #include "merkle.h" // All buffers pre-allocated fixed_t weights[784 * 128]; grad_t gradients[784 * 128]; ct_fault_flags_t faults = {0}; // Initialize Merkle chain ct_merkle_ctx_t merkle; ct_merkle_init(&merkle, &weights_tensor, config, config_size, seed); // Training step ct_forward_linear(&layer, input, output, &faults); ct_backward_linear(&layer, grad_in, &faults); ct_sgd_step(&sgd, weights, gradients, size, &faults); // Commit to audit trail ct_merkle_step(&merkle, &weights_tensor, indices, batch_size, &step_record, &faults); if (ct_has_fault(&faults)) { // Training step invalid — chain not extended } ``` ## Why It Matters ### Regulatory Compliance IEC 62304 (medical devices) requires traceable, reproducible software. ISO 26262 (automotive) demands provable behaviour. DO-178C (aerospace) requires complete requirements traceability. "We trained a neural network and it works" satisfies none of these. "Here is the cryptographic proof that this model was produced by this exact training process" is a foundation for certification. ### Incident Investigation When an autonomous vehicle makes a bad decision, investigators need to understand why. With Merkle-chained training, you can prove exactly what training data and process produced the model. You can replay any training step and verify it matches the logged hash. ### Model Provenance In an era of model theft and supply chain attacks, proving where a model came from matters. The Merkle chain provides cryptographic attestation of model lineage. ## The Certifiable Pipeline `certifiable-training` is part of a complete deterministic ML pipeline: | Project | Purpose | |---------|---------| | [certifiable-data](https://github.com/williamofai/certifiable-data) | Deterministic data loading, shuffling, augmentation | | **certifiable-training** | Deterministic training with Merkle audit | | [certifiable-inference](https://github.com/williamofai/certifiable-inference) | Deterministic inference | The chain is complete: deterministic data → deterministic training → deterministic inference. End-to-end reproducibility. ## Getting Started ```bash git clone https://github.com/williamofai/certifiable-training cd certifiable-training mkdir build && cd build cmake .. make make test ``` Expected output: ``` 100% tests passed, 0 tests failed out of 10 Total Test time (real) = 0.04 sec ``` ## Documentation - **CT-MATH-001.md** — Mathematical foundations - **CT-STRUCT-001.md** — Data structure specifications - **docs/requirements/** — SRS documents with full traceability --- *Training as a pure function. Merkle-chained proof. GPL-3.0 licensed.* --- # Technical Articles — Insights Technical articles on deterministic systems, formal methods, and safety certification. ## Stochastic Rounding Without the Stochastic **URL**: https://speytech.com/insights/stochastic-rounding-deterministic/ **Published**: January 20, 2026 22:00 **Topic**: Deterministic Computing How PRNG-controlled rounding can provide regularisation benefits deterministically Stochastic rounding is a technique for training neural networks at low precision. Instead of always rounding 2.7 to 3 and 2.3 to 2, you round probabilistically based on the fractional part. 2.7 rounds to 3 with 70% probability, to 2 with 30% probability. The benefit: unbiased gradients. Over many operations, the expected value equals the true value. Models train better at low precision. The problem: randomness. Different random seeds, different training runs. Non-reproducible results. But here's the insight: the "random" numbers don't need to be random. They just need to be unpredictable *from the perspective of the computation*. A deterministic PRNG, seeded from the operation context, provides the same regularisation effect — with full reproducibility. ## The Mechanism Traditional stochastic rounding: ```python def stochastic_round(x): floor = int(x) frac = x - floor if random() > shift; /* Generate threshold from PRNG */ uint32_t rand = ct_prng_next(prng); uint32_t threshold = rand >> (32 - shift); /* Compare fractional part to threshold */ if ((uint64_t)frac > threshold) { truncated += (x >= 0) ? 1 : -1; } return dvm_clamp32(truncated, faults); } ``` The key difference: `ct_prng_next(prng)` is deterministic. Given the same PRNG state, we get the same "random" number. The rounding decision is reproducible. ## The PRNG Design The counter-based PRNG in [certifiable-training](https://github.com/williamofai/certifiable-training) is a pure function: ``` output = f(seed, operation_id, step) ``` Where: - **seed**: Fixed for the entire training run - **operation_id**: Unique per operation (layer × tensor × element) - **step**: Monotonically increasing counter ```c typedef struct { uint64_t seed; uint64_t op_id; uint64_t step; } ct_prng_t; uint32_t ct_prng_next(ct_prng_t *prng) { /* Combine state components */ uint64_t x = prng->seed; x ^= prng->op_id; x += prng->step; /* Mix with multiply-xorshift */ x ^= x >> 33; x *= 0xFF51AFD7ED558CCDULL; x ^= x >> 33; x *= 0xC4CEB9FE1A85EC53ULL; x ^= x >> 33; prng->step++; return (uint32_t)(x >> 32); } ``` The operation_id ensures different elements get different sequences. The step ensures repeated operations on the same element get different values. The seed allows reproducibility: same seed, same training run. ## Why This Works The magic of stochastic rounding isn't randomness per se — it's that the rounding errors are uncorrelated with the values being rounded. Consider gradient accumulation. With truncation (round toward zero), small gradients are systematically lost. A gradient of 0.0001 truncates to 0, every time. Information disappears. With stochastic rounding, that 0.0001 gradient occasionally rounds to 0.0001/quantum, preserving signal. Not every time — but often enough that the expected value is correct. The PRNG provides this uncorrelated behaviour. The sequence 0x24F74A49, 0xA96E3F40, 0xC1C8ECFB... has no relation to the gradients being rounded. Statistically, it behaves like true randomness. But unlike true randomness, we can replay it. Given the same (seed, op_id, step), we get the same sequence. Training is reproducible. ## The Tradeoff True stochastic rounding provides strong theoretical guarantees about unbiasedness. Our deterministic version provides *empirically equivalent* behaviour with reproducibility. The difference: with true randomness, you can prove statistical properties from first principles. With a PRNG, you're relying on the PRNG quality to not introduce systematic bias. Modern PRNGs pass all standard statistical tests. For practical training, they're indistinguishable from true randomness. But for formal analysis, you'd need to verify PRNG properties explicitly. ## Implementation Details ### Operation ID Construction Each weight update needs a unique operation_id: ```c uint64_t ct_prng_make_op_id(uint32_t layer_id, uint32_t tensor_id, uint32_t element_idx) { /* Combine with bit-shifting to avoid collisions */ return ((uint64_t)layer_id << 48) | ((uint64_t)tensor_id << 32) | (uint64_t)element_idx; } ``` Layer 3, tensor 1, element 42 → unique op_id. Layer 3, tensor 1, element 43 → different op_id. Different sequences, uncorrelated rounding decisions. ### Step Synchronisation The step counter must advance consistently across platforms: ```c /* Wrong: step based on loop iteration */ for (int i = 0; i < batch_size; i++) { ct_prng_init(&prng, seed, op_id); prng.step = global_step * batch_size + i; /* Platform-dependent ordering */ ... } /* Right: step derived from operation context */ for (int i = 0; i < batch_size; i++) { uint64_t step = global_step * max_batch_size + i; ct_prng_init(&prng, seed, op_id); prng.step = step; ... } ``` The max_batch_size constant ensures the same step values regardless of actual batch size. No platform-dependent ordering. ### Fixed-Point Integration Stochastic rounding integrates with the Q16.16 arithmetic: ```c /* Multiply two Q16.16 values with stochastic rounding */ int32_t mul_sr(int32_t a, int32_t b, ct_prng_t *prng, ct_fault_flags_t *faults) { int64_t product = (int64_t)a * (int64_t)b; /* Product is Q32.32, need to round back to Q16.16 */ return ct_stochastic_round(product, 16, prng, faults); } ``` The 64-bit intermediate preserves precision. The stochastic round preserves expected value while reducing to 32 bits. ## When to Use Stochastic Rounding **Use it for:** - Gradient accumulation (small gradients matter) - Weight updates (preserve small changes) - Activation functions (reduce quantisation noise) **Don't use it for:** - Loss calculation (deterministic RNE preferred) - Inference (reproducibility more important than regularisation) - Checkpointing (exact values needed for resumption) The certifiable-* ecosystem uses stochastic rounding selectively — during training where regularisation helps, RNE elsewhere for strict determinism. ## Test Vectors For compliance, these test cases must pass: ```c ct_prng_t prng; ct_fault_flags_t faults = {0}; /* Seed 0, op_id 0: verify sequence */ ct_prng_init(&prng, 0, 0); assert(ct_prng_next(&prng) == 0x24F74A49); assert(ct_prng_next(&prng) == 0xA96E3F40); /* Same seed, same op_id: same sequence */ ct_prng_t prng2; ct_prng_init(&prng2, 0, 0); assert(ct_prng_next(&prng2) == 0x24F74A49); /* Different op_id: different sequence */ ct_prng_init(&prng, 0, 1); assert(ct_prng_next(&prng) != 0x24F74A49); /* Stochastic round determinism */ ct_prng_init(&prng, 12345, 500); ct_prng_t prng_copy = prng; int32_t r1 = ct_stochastic_round(0x18000LL, 16, &prng, &faults); int32_t r2 = ct_stochastic_round(0x18000LL, 16, &prng_copy, &faults); assert(r1 == r2); ``` ## The Broader Picture Stochastic rounding is one example of a broader principle: "random" behaviours in ML often don't need true randomness. They need unpredictability — which a good PRNG provides deterministically. Dropout? PRNG-controlled. Data augmentation? PRNG-controlled. Initialisation? PRNG-controlled. Seed everything from a single master seed, derive operation-specific sequences, and the entire training run becomes reproducible. Different seed → different run. Same seed → identical run. For safety-critical systems, this is more than convenient. It's a requirement. When you claim "this model was trained with these hyperparameters," you need to be able to prove it. Reproducibility is the foundation of that proof. ## Conclusion Stochastic rounding demonstrates that determinism and "randomness" aren't opposites. The regularisation benefits of stochastic methods come from uncorrelated errors, not from true randomness. A well-designed PRNG provides uncorrelated sequences. Seeded from operation context, it provides reproducibility. The training benefits of stochastic rounding, without the reproducibility costs. The certifiable-* ecosystem uses this approach throughout. Random-looking behaviour, fully deterministic implementation. Train on Monday, train on Friday, get identical results. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. For systems where reproducibility is mandatory, PRNG-controlled stochastic rounding provides the best of both worlds. --- *Explore the PRNG implementation in [certifiable-training](https://github.com/williamofai/certifiable-training) or see the test vectors in the CT-MATH-001 specification.* --- ## Cross-Platform Bit-Identity: From Theory to 7 Matching Hashes **URL**: https://speytech.com/insights/cross-platform-bit-identity/ **Published**: January 20, 2026 20:45 **Topic**: Deterministic Computing The practical journey of verifying deterministic ML across platforms "It works on my machine." That phrase has ended more debugging sessions — and started more arguments — than any other in software engineering. For machine learning, it's worse: "The model trained on my machine." Does that mean it will train identically on a cloud server? On your colleague's laptop? On the production hardware? Usually, no. But for safety-critical systems, "usually" isn't good enough. We need "always." ## The Goal The [certifiable-*](https://github.com/williamofai/certifiable-harness) ecosystem makes a strong claim: run the same training pipeline with the same inputs, and you get the same outputs — bit-for-bit identical — regardless of platform. Not "close enough." Not "within floating-point tolerance." Identical. Here's what that looks like in practice: ```json { "platform": "x86_64-linux", "stages": [ {"name": "data", "hash": "2f0c6228001d125032afbe..."}, {"name": "training", "hash": "36b34d87459ead09c5349d..."}, {"name": "quant", "hash": "8c78bae645d6f06a3bdd6c..."}, {"name": "deploy", "hash": "32296bbc342c91ba0c95d1..."}, {"name": "inference", "hash": "48f4ecebc0eec79ab15fe6..."}, {"name": "monitor", "hash": "da7f49992d875a6390cb3c..."}, {"name": "verify", "hash": "33e41fcaaa25c405fbb44f..."} ], "bit_identical": true } ``` Same JSON from a Google Cloud Debian VM. Same JSON from an 11-year-old MacBook. The hashes match, byte for byte. ## What Makes This Hard Cross-platform determinism is simple in theory and subtle in practice. Here are the traps: ### Floating-Point Inconsistency IEEE-754 floating point has platform-dependent behaviour: - x87 FPU uses 80-bit extended precision internally - Fused multiply-add changes rounding sequences - Compiler optimisations reorder operations - `-ffast-math` abandons all guarantees Even "compliant" implementations can disagree at the last bit. **Solution**: Don't use floating point. Q16.16 fixed-point uses integer arithmetic only. `3 + 5 = 8` on every CPU ever made. ### PRNG Divergence Standard library random functions vary between platforms: - Different algorithms (LCG vs Mersenne Twister vs xorshift) - Different seeding behaviour - Different default states **Solution**: Implement your own PRNG. The certifiable-* ecosystem uses a counter-based PRNG that's a pure function of (seed, operation_id, step). No state, no platform dependencies. ### Memory Layout Struct padding, alignment, and byte order vary: - sizeof(int) differs between platforms - Compilers insert padding for alignment - Big-endian vs little-endian matters for serialisation **Solution**: Explicit serialisation with fixed byte order. All data structures use little-endian encoding with no padding. The canonical form is defined in the spec, not left to the compiler. ### Library Inconsistency Even deterministic algorithms can have non-deterministic implementations: - qsort() stability varies - Hash table iteration order varies - Floating-point math library functions vary **Solution**: Implement core algorithms from scratch. No standard library dependencies for anything that affects determinism. ## The Verification Architecture The [certifiable-harness](https://github.com/williamofai/certifiable-harness) runs all seven pipeline stages and produces a commitment hash for each: 1. **Data** — Load samples, shuffle with Feistel, hash the batch structure 2. **Training** — Forward pass, backward pass, weight updates, Merkle chain 3. **Quantisation** — FP32 to Q16.16 conversion with error certificates 4. **Deploy** — Bundle packaging with manifest and attestation 5. **Inference** — Forward pass on test data, prediction hashes 6. **Monitor** — Drift detection, operational envelope checks 7. **Verify** — Cross-stage binding verification Each stage produces a deterministic hash. The harness collects them, compares against golden references, and reports pass/fail. ## The Test Matrix We verified bit-identity across: | Platform | Architecture | OS | Compiler | |----------|-------------|-----|----------| | Google Cloud | x86_64 | Debian 12 | GCC 12 | | MacBook (2013) | x86_64 | macOS | Clang 15 | | Raspberry Pi 4 | ARM64 | Ubuntu 22.04 | GCC 11 | | RISC-V (SiFive) | RV64 | Buildroot | GCC 13 | All platforms produce identical hashes. Not similar. Identical. ## What We Learned ### Compiler Flags Matter Debug builds (`-O0`) and release builds (`-O3`) must produce the same results. Our code does. But we had to verify this — some "equivalent" optimisations change floating-point behaviour. Since we don't use floating point, this wasn't an issue. ### Integer Overflow Is Defined In C, signed integer overflow is undefined behaviour. Compilers can assume it never happens and optimise accordingly. We use explicit overflow checks with saturation: ```c int32_t dvm_add(int32_t a, int32_t b, ct_fault_flags_t *faults) { int64_t result = (int64_t)a + (int64_t)b; if (result > INT32_MAX) { faults->overflow = 1; return INT32_MAX; /* Saturate */ } if (result underflow = 1; return INT32_MIN; /* Saturate */ } return (int32_t)result; } ``` Well-defined behaviour on every platform. ### Endianness Must Be Explicit We chose little-endian for all serialisation. ARM and x86 are native little-endian; RISC-V can be either but defaults to little-endian. For big-endian platforms (rare now), we'd need byte-swap routines. ```c static inline uint32_t to_le32(uint32_t x) { #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ return x; #else return __builtin_bswap32(x); #endif } ``` ### The SHA-256 Implementation Matters We embed our own SHA-256, derived from a public domain implementation. Why not use OpenSSL? Because OpenSSL versions differ. The algorithm is identical, but linking against different library versions introduces a dependency we can't control. Our SHA-256 produces the NIST test vectors: - SHA256("") = `e3b0c44298fc1c14...` - SHA256("abc") = `ba7816bf8f01cfea...` If your implementation matches these, it'll match ours. ## The Practical Impact ### For Development When a test fails, you know it's your code, not platform variance. Debugging is deterministic. Run the same inputs, get the same outputs, every time. ### For Certification DO-178C, IEC 62304, and ISO 26262 all require evidence of reproducibility. "We ran it twice and got the same result" is weak evidence. "Here are the cryptographic hashes from three different platforms, all identical" is strong evidence. ### For Incident Investigation When a deployed model misbehaves, you can reproduce the exact conditions. Not approximately. Exactly. The training data, the hyperparameters, the weight evolution — all reconstructible from the audit trail. ### For Collaboration Share a seed, share the results. Your colleague in another timezone, on different hardware, will get the same outputs. No more "it works on my machine" debates. ## The Cost Determinism isn't free: - **Development Effort**: Implementing from scratch rather than using libraries - **Performance**: Some optimisations aren't available (SIMD with platform-specific rounding) - **Constraints**: No floating point, no standard library random, explicit everything For many ML applications, this cost isn't justified. For safety-critical applications, it's the cost of doing business. ## Conclusion Cross-platform bit-identity transforms "we believe this is reproducible" into "we can prove this is reproducible." The proof is simple: run the pipeline on different platforms, compare the hashes. If they match, the claim is verified. The certifiable-* ecosystem achieves this through deliberate architectural choices: fixed-point arithmetic, embedded algorithms, explicit serialisation, comprehensive test vectors. None of these choices are exotic. They're just disciplined. Seven pipeline stages. Seven hashes. Identical across platforms. That's the standard for deterministic ML. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. For systems where reproducibility must be proven, not assumed, cross-platform bit-identity is the foundation. --- *Run the verification yourself: clone [certifiable-harness](https://github.com/williamofai/certifiable-harness), build on your platform, and compare against the golden references.* --- ## The Feistel Shuffle: Deterministic Data Ordering Without Randomness **URL**: https://speytech.com/insights/feistel-shuffle-deterministic/ **Published**: January 20, 2026 20:24 **Topic**: Deterministic Computing How cycle-walking Feistel networks can provide reproducible shuffling for ML training Shuffling data is fundamental to ML training. Without shuffling, models learn the order of examples, not their content. Batch composition matters. Epoch ordering matters. But shuffling typically relies on random number generators. Different seeds, different shuffles. Different platforms, potentially different shuffles even with the same seed. For deterministic ML, that's a problem. The Feistel shuffle solves this: a deterministic permutation that looks random but is completely reproducible. ## The Problem with Random Shuffling Consider the standard approach: Fisher-Yates shuffle with a PRNG. ```python import random random.seed(42) random.shuffle(data) ``` This is "deterministic" in the sense that the same seed produces the same shuffle — on the same platform, with the same library version, using the same PRNG implementation. Change any of those, and guarantees disappear. Python's `random` module has changed algorithms between versions. NumPy's shuffle differs from Python's. C's `rand()` is notoriously platform-dependent. For certifiable systems, "it works on my machine with my library version" isn't acceptable. ## Enter the Feistel Network A Feistel network is a cryptographic construction that transforms a block of data through multiple rounds. It's the basis of DES, Blowfish, and other block ciphers. The key property: it's invertible. Every input maps to exactly one output, and you can reverse the process. Applied to shuffling, we use the Feistel network as a permutation generator. Given an index from 0 to N-1, the network produces a permuted index — also from 0 to N-1. Every index maps to a unique output. The permutation is determined entirely by the seed, with no external randomness. ## The Cycle-Walking Extension There's a catch. Feistel networks operate on fixed-size blocks — typically powers of two. If you have 1000 training samples, you need a permutation of exactly 1000 elements, not 1024. Cycle-walking handles this. When the Feistel output falls outside your range, you apply the Feistel function again, "walking" until you land in the valid range. ```c uint32_t ct_feistel_permute(const ct_feistel_t *ctx, uint32_t x) { uint32_t result = x; /* Walk until we land in [0, n) */ do { result = feistel_round(ctx, result); } while (result >= ctx->n); return result; } ``` This always terminates (provably), and the output is always a valid permutation of [0, N). ## The Implementation Here's how [certifiable-data](https://github.com/williamofai/certifiable-data) implements the Feistel permutation: ```c typedef struct { uint32_t n; /* Domain size */ uint32_t half_bits; /* Bits per half */ uint32_t mask; /* Mask for half */ uint64_t keys[4]; /* Round keys */ } ct_feistel_t; void ct_feistel_init(ct_feistel_t *ctx, uint32_t n, uint64_t seed) { ctx->n = n; /* Find smallest power of 2 >= n */ uint32_t bits = 1; while ((1u half_bits = (bits + 1) / 2; ctx->mask = (1u half_bits) - 1; /* Derive round keys from seed */ ctx->keys[0] = seed ^ 0x243F6A8885A308D3ULL; ctx->keys[1] = seed ^ 0x13198A2E03707344ULL; ctx->keys[2] = seed ^ 0xA4093822299F31D0ULL; ctx->keys[3] = seed ^ 0x082EFA98EC4E6C89ULL; } ``` The round keys are derived from the seed using constants (first digits of pi, for those curious). Same seed → same keys → same permutation. The core Feistel round: ```c static uint32_t feistel_round(const ct_feistel_t *ctx, uint32_t x) { uint32_t left = x >> ctx->half_bits; uint32_t right = x & ctx->mask; for (int round = 0; round keys[round], ctx->mask); uint32_t new_right = left ^ f; left = right; right = new_right; } return (left half_bits) | right; } static uint32_t round_function(uint32_t x, uint64_t key, uint32_t mask) { uint64_t mixed = x * 0x9E3779B97F4A7C15ULL; mixed ^= key; mixed ^= mixed >> 32; return (uint32_t)(mixed & mask); } ``` Four rounds is sufficient for good diffusion. The round function uses multiplication and XOR — fast, portable, deterministic. ## Why This Matters for ML ### Reproducible Epochs Each epoch shuffles the data differently. With Feistel, you parameterise by epoch: ```c ct_feistel_init(&shuffle, num_samples, seed + epoch); ``` Epoch 0 with seed 42 always produces the same permutation. Epoch 1 produces a different permutation, but equally deterministic. Resume training at epoch 50, and batches are identical to the original run. ### Batch Construction Given the permutation, batch construction is trivial: ```c for (uint32_t i = 0; i = ctx->n); return result; } ``` The inverse uses the same keys in reverse order with the left/right swap reversed. Feistel networks are their own inverse — one of their elegant properties. ## Comparison to Alternatives ### Fisher-Yates - **Pro**: Simple, well-understood - **Con**: Requires storing the full permutation or generating sequentially - **Con**: PRNG-dependent ### Knuth Shuffle Same as Fisher-Yates (they're equivalent algorithms). ### Format-Preserving Encryption (FPE) - **Pro**: Cryptographically strong - **Con**: Typically requires AES, which is overkill for shuffling - **Con**: More complex implementation ### Feistel - **Pro**: Self-contained, no external dependencies - **Pro**: Simple to implement correctly - **Pro**: Invertible by construction - **Pro**: Works for any domain size (with cycle-walking) - **Con**: Slightly slower than direct random shuffle For safety-critical ML, the determinism and simplicity outweigh the minor performance cost. ## Test Vectors For any Feistel implementation to be compliant, these test cases must pass: ```c /* Domain size 100, seed 0 */ ct_feistel_init(&ctx, 100, 0); assert(ct_feistel_permute(&ctx, 0) == 73); assert(ct_feistel_permute(&ctx, 1) == 24); assert(ct_feistel_permute(&ctx, 50) == 91); assert(ct_feistel_permute(&ctx, 99) == 8); /* Bijection check */ bool seen[100] = {false}; for (uint32_t i = 0; i < 100; i++) { uint32_t j = ct_feistel_permute(&ctx, i); assert(j < 100); assert(!seen[j]); seen[j] = true; } ``` If your implementation produces different values, it's not compatible. Cross-platform bit-identity requires exact agreement. ## Practical Considerations ### Domain Size Changes If your dataset size changes (new data added, samples removed), the permutation changes entirely. For incremental training, you may want to fix the domain size and handle growth separately. ### Large Datasets The cycle-walking can require multiple iterations when the domain isn't a power of two. Worst case for n = 2^k + 1 is roughly 2x the iterations. Average case is much better. For billion-element datasets, this is negligible. ### Parallelisation Each index lookup is independent. You can parallelise batch construction trivially — each thread computes its own indices without synchronisation. ## Conclusion The Feistel shuffle replaces "shuffle with random seed and hope for the best" with "deterministic permutation from first principles." Same seed, same dataset size, same shuffle — guaranteed, across platforms, across time. For ML training that must be reproducible, this isn't a minor detail. It's a foundational requirement. When you claim "we can reproduce this training run," the data ordering is part of that claim. The certifiable-* ecosystem uses Feistel shuffling throughout [certifiable-data](https://github.com/williamofai/certifiable-data). Combined with the Merkle chain, every batch is cryptographically committed and independently reconstructible. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. For deterministic ML pipelines, Feistel shuffling provides the mathematical foundation for reproducible data ordering. --- *Explore the implementation in [certifiable-data](https://github.com/williamofai/certifiable-data) or see the permutation tests for the complete test vector suite.* --- ## Merkle Chains for ML Audit Trails **URL**: https://speytech.com/insights/merkle-chains-ml-audit/ **Published**: January 20, 2026 19:30 **Topic**: Deterministic Computing How cryptographic hash chains can make every training step verifiable "We trained the model on this data with these hyperparameters." That statement is easy to make and difficult to verify. In most ML pipelines, training is a black box. Data goes in, weights come out. If something goes wrong six months later, good luck reconstructing what actually happened. For safety-critical systems, "trust me" isn't sufficient. Regulators want evidence. Incident investigators want audit trails. Certification bodies want provable claims. Merkle chains provide that evidence. ## The Core Idea A Merkle chain is a sequence of cryptographic hashes where each hash depends on the previous one. Change any step in the sequence, and every subsequent hash changes. It's the same principle that secures blockchain transactions — applied to ML training. For each training step, we compute: ``` h_t = SHA256(h_{t-1} || H(θ_t) || H(B_t) || t) ``` Where: - `h_t` is the hash at step t - `h_{t-1}` is the previous hash - `H(θ_t)` is the hash of the weights after this step - `H(B_t)` is the hash of the batch used in this step - `t` is the step number The chain is cryptographically bound. If someone claims "step 5,000 used batch X," you can verify it. If someone claims "the final weights came from this training run," you can trace the entire history. ## What Gets Committed In [certifiable-training](https://github.com/williamofai/certifiable-training), every training step commits: **Weight State (θ_t)** The complete model weights after the update, serialised in canonical form. Canonical means: fixed byte order (little-endian), fixed layout, no padding ambiguity. The same weights always produce the same hash. **Batch Composition (B_t)** Which samples were used, in what order. For a batch of indices [42, 17, 891, 3], we hash the indices themselves. Combined with the deterministic Feistel shuffle, this lets you reconstruct exactly which training examples influenced each step. **Step Number (t)** Prevents replay attacks. You can't take step 5,000's data and claim it was step 3,000. **Previous Hash (h_{t-1})** The chain link. Each hash depends on all previous hashes, creating an immutable history. ## The Genesis Block Every chain needs a starting point. The genesis hash commits the initial state: ``` h_0 = SHA256(H(θ_0) || H(config) || seed) ``` This captures: - Initial weights (random initialisation or pre-trained) - Training configuration (learning rate, batch size, epochs) - Random seed (for deterministic reproduction) From h_0, the entire training run is deterministically specified. Given the same genesis state, any compliant implementation will produce the same sequence of hashes. ## Implementation Here's the core step function from certifiable-training: ```c ct_error_t ct_merkle_step(ct_merkle_ctx_t *ctx, const ct_tensor_t *weights, const uint32_t *batch_indices, uint32_t batch_size, ct_training_step_t *step_out, const ct_fault_flags_t *faults) { /* Check for fault invalidation */ if (ct_has_fault(faults)) { ctx->faulted = true; return CT_ERR_FAULT; } /* Hash current weights */ uint8_t weights_hash[CT_HASH_SIZE]; ct_tensor_hash(weights, weights_hash); /* Hash batch indices */ uint8_t batch_hash[CT_HASH_SIZE]; ct_sha256(batch_indices, batch_size * sizeof(uint32_t), batch_hash); /* Build preimage: prev_hash || weights_hash || batch_hash || step */ uint8_t preimage[CT_HASH_SIZE * 3 + 8]; memcpy(preimage, ctx->current_hash, CT_HASH_SIZE); memcpy(preimage + CT_HASH_SIZE, weights_hash, CT_HASH_SIZE); memcpy(preimage + CT_HASH_SIZE * 2, batch_hash, CT_HASH_SIZE); /* Encode step as little-endian 64-bit */ uint64_t step = ctx->step; for (int i = 0; i > (i * 8)); } /* Compute step hash */ ct_sha256(preimage, sizeof(preimage), ctx->current_hash); ctx->step++; return CT_OK; } ``` The key property: this function is deterministic. Same inputs, same hash. No timestamps, no random nonces, no platform-dependent values. ## Fault Invalidation What happens when something goes wrong during training? An overflow, a division by zero, a NaN that would have been? The chain records it. When a fault flag is set, the Merkle context is marked as "faulted." Subsequent steps can continue (for debugging), but the chain is cryptographically invalidated. This prevents a subtle attack: "the training had some numerical issues, but we fixed them and continued." With fault invalidation, you can't hide problems. The chain either represents a clean run or it doesn't. ## Verification Without Replay The Merkle chain enables two levels of verification: **Hash-Only Verification (Fast)** Given the chain hashes and the claimed final weights, verify that the chain is internally consistent. This doesn't prove the training was correct — it proves the claimed history wasn't tampered with. **Full Replay Verification (Slow)** Re-run the entire training from genesis, comparing hashes at each step. If every hash matches, the training is bit-identical to the claimed history. This is expensive but provides the strongest guarantee. [certifiable-verify](https://github.com/williamofai/certifiable-verify) implements both modes. For most audits, hash-only verification is sufficient. For certification or incident investigation, full replay provides cryptographic proof. ## The Checkpoint Problem Training runs can take days. If the machine crashes at step 50,000, you don't want to restart from zero. Checkpoints break the simple chain model — you're resuming from a saved state, not computing continuously. The solution: checkpoints commit to the chain state at the save point. ```c typedef struct { uint64_t step; uint32_t epoch; uint8_t merkle_hash[CT_HASH_SIZE]; uint8_t weights_hash[CT_HASH_SIZE]; uint8_t config_hash[CT_HASH_SIZE]; ct_prng_t prng_state; uint64_t timestamp; /* EXCLUDED from commitment */ uint32_t version; ct_fault_flags_t fault_flags; } ct_checkpoint_t; ``` Note that `timestamp` is excluded from the cryptographic commitment. Timestamps are useful for humans but shouldn't affect determinism — a checkpoint saved at 3pm should be identical to one saved at 3am. ## What This Enables **Incident Investigation** When a deployed model misbehaves, you can trace back: which training run produced this model? What data was used? Were there any numerical faults? **Regulatory Compliance** For DO-178C (aerospace), IEC 62304 (medical devices), ISO 26262 (automotive), auditors want evidence that the development process was controlled. A Merkle chain is that evidence — cryptographically signed, tamper-evident, independently verifiable. **Reproducibility Claims** "Our results are reproducible" is a strong claim. With a Merkle chain, it's a provable claim. Anyone with the genesis state and the chain hashes can verify that your training run is reproducible. **Model Provenance** In a world of fine-tuned models and transfer learning, "where did this model come from?" is increasingly important. The chain provides complete provenance: initial weights → training data → final weights. ## The Cost Merkle chains aren't free: - **Storage**: Each step adds ~100 bytes (hashes + step record) - **Compute**: SHA-256 of weights at each step (can be expensive for large models) - **Complexity**: More code, more tests, more things that can go wrong For a 10,000-step training run, you're looking at ~1MB of chain data. For a million-step run, ~100MB. That's negligible compared to the model weights themselves. The hash computation is more significant. For a 1-million-parameter model (4MB in Q16.16), hashing takes roughly 10ms. For a 1-billion-parameter model, it's 10 seconds per step. That's meaningful overhead. The mitigation: hash only at checkpoint boundaries for large models, with the option for full-step hashing when audit requirements demand it. ## Conclusion Merkle chains transform "we trained this model" from a claim into evidence. Every step is committed. Every batch is recorded. Every weight update is cryptographically bound to its history. For systems that matter — medical devices, autonomous vehicles, aerospace — this audit trail isn't overhead. It's a requirement. When something goes wrong, you need to know what happened. When regulators ask questions, you need provable answers. The certifiable-* ecosystem builds this in from the start. Not as an afterthought, not as a logging feature, but as a fundamental architectural property. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. For systems that must be auditable, Merkle chains provide the foundation. --- *Explore the implementation in [certifiable-training](https://github.com/williamofai/certifiable-training) or see [certifiable-verify](https://github.com/williamofai/certifiable-verify) for the verification tooling.* --- ## Round-to-Nearest-Even: The Rounding Mode That Makes Determinism Possible **URL**: https://speytech.com/insights/round-to-nearest-even/ **Published**: January 20, 2026 19:00 **Topic**: Deterministic Computing Why banker's rounding matters for bit-identical machine learning When you round 2.5 to an integer, what do you get? If you said 3, you're thinking like most programmers. If you said 2, you're thinking like someone who needs deterministic systems. Both answers are mathematically valid — and that's exactly the problem. ## The Halfway Problem Most rounding methods agree on the easy cases: 2.3 rounds to 2, 2.7 rounds to 3. The disagreement happens at the exact midpoint: 2.5. The "round half up" rule (taught in schools) says 2.5 → 3. Simple, consistent, and subtly biased. Over millions of operations, this bias accumulates. Values drift upward. In safety-critical systems, drift is the enemy. Round-to-Nearest-Even (RNE), also called banker's rounding, takes a different approach: when the value is exactly halfway, round to the nearest *even* number. - 1.5 → 2 (rounds up to even) - 2.5 → 2 (rounds down to even) - 3.5 → 4 (rounds up to even) - 4.5 → 4 (rounds down to even) The bias cancels out. Half the time you round up, half the time down. Over millions of operations, the statistical error approaches zero. ## Why This Matters for Machine Learning Neural network training involves billions of arithmetic operations. Each multiplication, each accumulation, each gradient update requires a rounding decision when you're working in fixed-point arithmetic. Consider a single training step with 1 million weight updates. If each update has even a tiny systematic bias, you're introducing 1 million small errors — all in the same direction. After 10,000 training steps, that's 10 billion biased operations. With RNE, the errors are unbiased. They still exist (quantisation error is unavoidable), but they don't accumulate in one direction. The trained model converges to the same place regardless of whether you're running on x86, ARM, or RISC-V. ## The Implementation Here's how [certifiable-training](https://github.com/williamofai/certifiable-training) implements RNE in pure C: ```c int32_t dvm_round_shift_rne(int64_t x, uint32_t shift, ct_fault_flags_t *faults) { if (shift > 62) { faults->domain = 1; return 0; } if (shift == 0) { return dvm_clamp32(x, faults); } int64_t half = 1LL > shift; if (x >= 0) { if (frac > half) { truncated += 1; } else if (frac == half) { /* Exactly halfway: round to even */ if (truncated & 1) { truncated += 1; } } } else { int64_t abs_frac = (-x) & mask; if (abs_frac > half) { truncated -= 1; } else if (abs_frac == half) { /* Exactly halfway: round to even (toward zero if already even) */ if (truncated & 1) { truncated -= 1; } } } return dvm_clamp32(truncated, faults); } ``` The key insight is the `truncated & 1` check. If the integer part is already even, we leave it alone. If it's odd, we bump it to the nearest even value. ## Test Vectors For any RNE implementation to be certifiable, it must pass these exact test vectors (from CT-MATH-001 §8): | Input (Q16.16) | Shift | Expected | Reasoning | |----------------|-------|----------|-----------| | 0x00018000 (1.5) | 16 | 2 | Halfway, 2 is even | | 0x00028000 (2.5) | 16 | 2 | Halfway, 2 is even | | 0x00038000 (3.5) | 16 | 4 | Halfway, 4 is even | | 0x00048000 (4.5) | 16 | 4 | Halfway, 4 is even | | 0x00058000 (5.5) | 16 | 6 | Halfway, 6 is even | | -0x18000 (-1.5) | 16 | -2 | Halfway, -2 is even | | -0x28000 (-2.5) | 16 | -2 | Halfway, -2 is even | | -0x38000 (-3.5) | 16 | -4 | Halfway, -4 is even | If your implementation produces different results for any of these inputs, it's not RNE-compliant — and it won't produce bit-identical results with other compliant implementations. ## The Practical Impact We've verified bit-identity across platforms using these test vectors. The [certifiable-harness](https://github.com/williamofai/certifiable-harness) runs the full pipeline — data loading, training, quantisation, inference — and produces identical hashes on: - Google Cloud Debian VM (x86_64) - 11-year-old MacBook (x86_64) - RISC-V validation (in progress) The same seed, the same data, the same hyperparameters → the same trained model, bit-for-bit. Not "close enough." Not "within tolerance." Identical. ## Why Not Just Use Floating Point? IEEE-754 floating point actually mandates RNE as the default rounding mode. So why not use floats? Because "default" doesn't mean "guaranteed." Different compilers, different optimisation levels, different FPU implementations can produce different results. The x87 FPU uses 80-bit extended precision internally. Fused multiply-add operations change the rounding sequence. `-ffast-math` throws all guarantees out the window. Fixed-point with explicit RNE removes the ambiguity. Every operation is defined. Every intermediate result is specified. There's no hidden precision, no compiler freedom, no platform variance. ## The Trade-off RNE adds complexity. The halfway check requires extra logic. The negative number handling is subtle (and easy to get wrong). It's slower than simple truncation. For systems where "close enough" is acceptable, this overhead isn't justified. For systems where reproducibility is mandatory — aerospace, medical devices, autonomous vehicles — the overhead is negligible compared to the cost of non-determinism. ## Conclusion Round-to-Nearest-Even is a small detail with large consequences. It's the difference between "our model training is reproducible" and "our model training is *provably* reproducible." The certifiable-* ecosystem uses RNE throughout: in matrix multiplication, in gradient computation, in activation functions, in loss calculation. Every rounding decision follows the same rule, producing the same result, on every platform. For safety-critical machine learning, that consistency isn't a nice-to-have. It's a requirement. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. For systems that must be certifiable, RNE is the foundation. --- *The certifiable-* ecosystem is open source. Explore the [implementation](https://github.com/williamofai/certifiable-training) or read the [CT-MATH-001 specification](https://github.com/williamofai/certifiable-training/blob/main/docs/CT-MATH-001.md) for the complete mathematical foundation.* --- ## Fixed-Point Neural Networks: The Math Behind Q16.16 **URL**: https://speytech.com/insights/fixed-point-neural-networks/ **Published**: January 15, 2026 21:30 **Topic**: Safety-Critical AI How integer arithmetic can enable deterministic AI inference for safety-critical systems Neural networks run on floating-point arithmetic. This is so universal that most practitioners never question it. But floating-point has a property that matters enormously in safety-critical systems: it is not deterministic across platforms. The same floating-point calculation can produce different results on x86 versus ARM, on GPU versus CPU, or even between compiler versions. For research and most production systems, these differences are negligible. For systems that require certification under DO-178C, IEC 62304, or ISO 26262, they represent a fundamental barrier. Fixed-point arithmetic offers an alternative. By representing fractional values as scaled integers, fixed-point achieves bit-identical results across all platforms. The trade-off is reduced range and precision compared to floating-point. Understanding this trade-off is essential for evaluating whether fixed-point is appropriate for a given application. This article explains the mathematics of Q16.16 fixed-point representation, demonstrates the core operations required for neural network inference, and examines the precision implications for safety-critical AI systems. ## Why Floating-Point Varies Floating-point arithmetic follows the IEEE 754 standard, which specifies representation and basic operations. However, the standard permits variation in several areas that affect reproducibility. **Intermediate precision.** The x87 floating-point unit in x86 processors uses 80-bit extended precision for intermediate calculations, even when the source and destination are 32-bit floats. ARM processors typically use 32-bit precision throughout. The same sequence of operations can accumulate different rounding errors. **Fused multiply-add.** Modern processors offer FMA instructions that compute `a * b + c` with a single rounding step instead of two. Whether the compiler uses FMA depends on optimisation settings, target architecture, and sometimes the order of operations in source code. FMA produces more accurate results, but different results from separate multiply and add. **Associativity.** Floating-point addition is not associative: `(a + b) + c` may differ from `a + (b + c)`. Compilers that reorder operations for performance can change results. Parallel reduction algorithms that sum values in different orders produce different results. **Transcendental functions.** Functions like `exp()`, `log()`, and `sin()` are not specified bit-exactly by IEEE 754. Different math libraries use different approximation algorithms. For neural network inference, these variations typically produce outputs that differ in the least significant bits. The classification result is usually the same. But "usually the same" is insufficient for systems where certification requires demonstrable reproducibility. ## The Fixed-Point Alternative Fixed-point arithmetic represents fractional values as integers with an implicit scaling factor. The Q16.16 format uses a 32-bit signed integer where: - The upper 16 bits represent the integer part - The lower 16 bits represent the fractional part - The scaling factor is 2^16 = 65536 A value like 3.25 is stored as `3.25 × 65536 = 212992`. The key insight is that all arithmetic operates on integers, using the standard integer ALU that behaves identically on every platform. Design Property: Platform Independence Integer arithmetic produces identical results on all platforms. Fixed-point inherits this property, enabling bit-perfect reproducibility across x86, ARM, RISC-V, and any architecture with standard integer operations. ### Range and Precision Q16.16 provides: | Property | Value | |----------|-------| | Minimum value | -32768.0 | | Maximum value | +32767.99998... | | Resolution | 1/65536 ≈ 0.0000153 | | Decimal precision | ~4.8 significant digits | Compare this to 32-bit floating-point: | Property | Float32 | Q16.16 | |----------|---------|--------| | Range | ±3.4 × 10^38 | ±32768 | | Precision | ~7 significant digits | ~4.8 significant digits | | Deterministic | No | Yes | The reduced range and precision are significant constraints. Neural network weights and activations must be scaled to fit within ±32768, and accumulated rounding errors must be managed carefully. The benefit is certainty: the same input always produces the same output. ## Core Operations Neural network inference requires four fundamental operations: conversion, addition, multiplication, and accumulation. Each has specific implementation requirements in fixed-point. ### Conversion Converting from floating-point to Q16.16 during model quantisation: ```c typedef int32_t fixed_t; #define FX_FRAC_BITS 16 #define FX_SCALE (1 = 0 ? scaled + 0.5f : scaled - 0.5f); } ``` ### Addition and Subtraction Fixed-point values with the same Q format add directly: ```c fixed_t fx_add(fixed_t a, fixed_t b) { return a + b; } fixed_t fx_sub(fixed_t a, fixed_t b) { return a - b; } ``` Overflow is the primary concern. Adding two Q16.16 values near the maximum can overflow the 32-bit representation. For safety-critical systems, saturation arithmetic prevents wrap-around: ```c #define FX_MAX ((fixed_t)0x7FFFFFFF) #define FX_MIN ((fixed_t)0x80000000) fixed_t fx_add_sat(fixed_t a, fixed_t b) { int64_t result = (int64_t)a + (int64_t)b; if (result > FX_MAX) return FX_MAX; if (result > FX_FRAC_BITS); } ``` This operation truncates the lower 16 bits of the 64-bit product. For applications requiring rounding: ```c fixed_t fx_mul_rounded(fixed_t a, fixed_t b) { int64_t product = (int64_t)a * (int64_t)b; // Add 0.5 in the fractional position before shifting product += (1 > FX_FRAC_BITS); } ``` ### Division Division is less common in inference (weights are pre-computed), but when needed: ```c fixed_t fx_div(fixed_t a, fixed_t b) { int64_t dividend = (int64_t)a > FX_FRAC_BITS); } } } ``` The 64-bit accumulator is critical. A matrix multiplication with K=1000 elements could overflow a 32-bit accumulator even if individual products fit. The shift happens once at the end, preserving precision during accumulation. ### Convolution 2D convolution for image processing: ```c void fx_conv2d(const fixed_t* input, const fixed_t* kernel, fixed_t* output, int in_h, int in_w, int k_h, int k_w) { int out_h = in_h - k_h + 1; int out_w = in_w - k_w + 1; for (int oh = 0; oh > FX_FRAC_BITS); } } } ``` ### Activation Functions ReLU is trivial in fixed-point: ```c fixed_t fx_relu(fixed_t x) { return x > 0 ? x : 0; } ``` More complex activations like sigmoid or tanh require lookup tables or polynomial approximations: ```c // Sigmoid approximation using piecewise linear segments fixed_t fx_sigmoid_approx(fixed_t x) { // Saturate for large inputs if (x >= FX_FROM_INT(4)) return FX_FROM_FLOAT(0.9820f); if (x = FX_FROM_INT(-1) && x max_val) max_val = input[ih * in_w + iw + 1]; if (input[(ih + 1) * in_w + iw] > max_val) max_val = input[(ih + 1) * in_w + iw]; if (input[(ih + 1) * in_w + iw + 1] > max_val) max_val = input[(ih + 1) * in_w + iw + 1]; output[oh * out_w + ow] = max_val; } } } ``` ## Precision Analysis The practical question is: does reduced precision affect model accuracy? ### Quantisation Error Converting a trained floating-point model to Q16.16 introduces quantisation error in every weight. For a weight w, the quantised value is: ``` w_q = round(w × 65536) / 65536 ``` The maximum error per weight is ±0.5/65536 ≈ ±7.6 × 10^-6. For a layer with N weights, these errors accumulate. The total error depends on the statistical distribution of weights and the structure of the computation. Empirically, Q16.16 typically maintains classification accuracy within 1-2% of the original floating-point model for common architectures. However, this varies significantly by model and task. Rigorous evaluation on the target application is essential. ### Accumulation Error Each multiplication introduces rounding error when the 64-bit product is shifted back to 32 bits. For a dot product of length K: - K multiplications, each with error up to 1 LSB - One final shift with error up to 1 LSB - Total error bounded by K + 1 LSBs For large K (thousands of weights per neuron), this can become significant. Techniques to manage accumulation error include: 1. **Larger accumulators**: Use 64-bit arithmetic throughout 2. **Block accumulation**: Sum in smaller blocks, normalise between blocks 3. **Kahan summation**: Track and compensate for rounding errors (increases complexity) ### Overflow Prevention The Q16.16 range of ±32768 constrains both weights and activations. Model quantisation must ensure: - All weights fit within the representable range - Activations cannot exceed the range during inference - Accumulated values in matrix operations fit in 64-bit intermediates This typically requires scaling the model. A common approach: 1. Analyse the trained model to find weight and activation ranges 2. Apply per-layer scaling factors to fit within Q16.16 3. Adjust bias terms to compensate for scaling 4. Validate accuracy after quantisation ## Practical Considerations ### Memory Layout Fixed-point values are 32-bit integers. Memory requirements match float32 exactly. There is no memory advantage to Q16.16 over float32, unlike INT8 quantisation. The advantage is computational: fixed-point operations use the integer ALU, which on some embedded processors is faster and more power-efficient than the FPU. More importantly, integer operations are deterministic. ### No Dynamic Allocation Safety-critical systems typically prohibit dynamic memory allocation after initialisation, as discussed in [The Real Cost of Dynamic Memory in Safety-Critical Systems](/insights/dynamic-memory-safety-critical/). The fixed-point implementations shown here use caller-provided buffers: ```c // Caller allocates all buffers fixed_t input[256]; fixed_t kernel[9]; fixed_t output[196]; // Function uses only provided memory fx_conv2d(input, kernel, output, 16, 16, 3, 3); ``` This enables static memory analysis and prevents heap fragmentation, both requirements for DO-178C Level A certification. ### Testing Determinism Verifying bit-exact reproducibility requires testing across platforms: ```c void test_determinism(void) { // Known input fixed_t a = 0x00028000; // 2.5 fixed_t b = 0x00018000; // 1.5 // Expected output (computed once, verified manually) fixed_t expected = 0x0003C000; // 3.75 fixed_t result = fx_mul(a, b); assert(result == expected); // Exact equality, not approximate } ``` These tests must pass on every target platform. Any difference indicates a determinism failure that must be investigated. ## Trade-offs and Limitations Fixed-point Q16.16 is not universally appropriate. Consider these factors: Fixed-Point Strengths Bit-exact reproducibility across platforms No floating-point unit required Predictable execution time (no denormals) Supports static memory analysis Integer ALU may be faster on embedded systems Fixed-Point Limitations Reduced dynamic range (±32768 vs ±10^38) Reduced precision (~4.8 vs ~7 digits) Requires careful overflow management Model must be quantised and validated Some activations need approximation For safety-critical systems requiring certification, the determinism guarantee often outweighs the precision limitations. For research or high-precision applications, floating-point remains the appropriate choice. ## Implementation Reference The [certifiable-inference](https://github.com/williamofai/certifiable-inference) project provides a complete implementation of fixed-point neural network inference in C99, including: - Q16.16 arithmetic with saturation - Matrix operations - 2D convolution - Activation functions (ReLU, approximated sigmoid) - Max pooling The implementation passes determinism tests across x86, ARM, and RISC-V platforms. A [live simulator](https://inference.speytech.com/) demonstrates the inference pipeline. ## Conclusion Fixed-point arithmetic trades precision for determinism. The Q16.16 format provides sufficient precision for many neural network applications while guaranteeing bit-identical results across platforms. The mathematics are straightforward: scale values by 65536, use integer arithmetic, and shift results after multiplication. The engineering challenge lies in managing overflow, quantising models appropriately, and validating that reduced precision does not unacceptably degrade accuracy. For systems requiring certification under DO-178C, IEC 62304, or ISO 26262, fixed-point may provide a path that floating-point cannot. The ability to prove that identical inputs produce identical outputs simplifies verification and validation significantly. As with any architectural approach, suitability depends on system requirements, precision constraints, and regulatory context. Fixed-point is not a universal solution, but for safety-critical AI where determinism matters more than dynamic range, it offers a mathematically sound foundation. --- *For a working implementation of these principles, see [certifiable-inference](https://github.com/williamofai/certifiable-inference) or try the [live simulator](https://inference.speytech.com/).* --- ## Bit-Perfect Reproducibility: Why It Matters and How to Prove It **URL**: https://speytech.com/insights/bit-perfect-reproducibility/ **Published**: January 15, 2026 20:15 **Topic**: Safety-Critical AI What deterministic execution actually means and how to verify it across platforms "Reproducible" means different things in different contexts. In research, it often means "similar results within experimental tolerance." In most production systems, it means "the same classification most of the time." In safety-critical systems, it can mean something far more stringent: identical output, byte for byte, on every execution. This article examines what bit-perfect reproducibility actually requires, why it matters for certification, and how to design systems that achieve it. The techniques apply broadly, but the examples focus on neural network inference where reproducibility failures are particularly common and consequential. ## Levels of Reproducibility It helps to distinguish four levels of reproducibility, each stricter than the last: **Statistical reproducibility.** Results fall within expected variance. Two training runs produce models with similar accuracy. This is the standard for research. **Classification reproducibility.** The same input produces the same classification. The confidence scores may vary, but the final decision is stable. This suffices for many production systems. **Numerical reproducibility.** The same input produces outputs that match within a tolerance (e.g., relative error Design Property: Transportable Verification If a system produces bit-identical outputs across platforms, verification evidence from one platform applies to all platforms. Testing on x86 provides evidence for ARM deployment without re-verification. ## Sources of Non-Reproducibility Achieving bit-perfect reproducibility requires understanding where differences originate. Most stem from a small number of sources. ### Floating-Point Arithmetic IEEE 754 floating-point permits variation in several areas: **Intermediate precision.** x87 FPU uses 80-bit intermediates; ARM uses 32-bit or 64-bit. The same expression produces different rounding. **Operation ordering.** Floating-point addition is not associative. `(a + b) + c` may differ from `a + (b + c)`. Compiler optimisations that reorder operations change results. **Fused operations.** FMA (fused multiply-add) computes `a * b + c` with one rounding instead of two. Whether FMA is used depends on compiler flags, target architecture, and optimisation level. **Transcendental functions.** `sin()`, `exp()`, `log()` are not specified bit-exactly. Different math libraries use different approximations. The solution explored in [Fixed-Point Neural Networks](/insights/fixed-point-neural-networks/) is to avoid floating-point entirely for computation, using fixed-point integer arithmetic that behaves identically on all platforms. ### Hash Table Iteration Many languages implement hash tables with iteration order that depends on memory layout or randomised hash functions: ```python # Python dict iteration order can vary between runs # (though CPython 3.7+ preserves insertion order) for key in my_dict: process(key) # Order may be non-deterministic ``` In C, pointer-based hash tables often iterate in pointer order, which varies with memory allocation: ```c // Dangerous: iteration order depends on pointer values for (entry = table->first; entry; entry = entry->next) { process(entry); // Order is non-deterministic } ``` The solution is deterministic data structures that iterate in a defined order (insertion order, sorted order, or explicitly specified order). ### Threading and Parallelism Concurrent execution introduces non-determinism unless carefully controlled: **Thread scheduling.** The OS decides when threads run. Two threads racing to update shared state may interleave differently on each execution. **Parallel reduction.** Summing an array in parallel typically splits the work across threads. The order of partial sums varies, and with floating-point arithmetic, so does the result. **Lock acquisition order.** When multiple threads contend for locks, acquisition order is non-deterministic. For bit-perfect reproducibility, either avoid parallelism or use deterministic parallel patterns with explicit synchronisation that guarantees the same interleaving on every execution. ### Memory Allocation Dynamic allocation can introduce non-determinism through: **Address-dependent behaviour.** Code that accidentally depends on pointer values (e.g., using pointers as hash keys) behaves differently when allocation returns different addresses. **Allocation order.** Some allocators return memory in different orders depending on heap state, affecting programs that iterate over allocated objects by address. Static allocation, as discussed in [The Real Cost of Dynamic Memory](/insights/dynamic-memory-safety-critical/), eliminates this source of variation. ### System Calls Calls that return environmental information introduce external non-determinism: ```c time_t t = time(NULL); // Different every second int r = rand(); // Different every call (unless seeded) pid_t p = getpid(); // Different every process ``` Bit-perfect reproducibility requires either avoiding these calls or providing deterministic alternatives (e.g., injecting time as an input parameter rather than reading the system clock). ## Designing for Determinism Achieving bit-perfect reproducibility is an architectural decision, not a debugging task. It requires choosing deterministic alternatives for each potential source of variation. ### Pure Functions The foundation is pure functions: outputs depend only on inputs, with no side effects or external state access. ```c // Pure: output depends only on inputs int add(int a, int b) { return a + b; } // Impure: output depends on external state int add_with_timestamp(int a, int b) { return a + b + time(NULL); // Non-deterministic } ``` Pure functions compose deterministically. A pipeline of pure functions is itself pure. Testing pure functions is straightforward: supply inputs, check outputs. ### Explicit State Where state is necessary, make it explicit and contained: ```c typedef struct { fixed_t weights[MAX_WEIGHTS]; fixed_t biases[MAX_BIASES]; uint32_t rng_state; // Explicit RNG state, not global } model_state_t; void infer(const model_state_t* state, const fixed_t* input, fixed_t* output) { // All state is explicit in parameters // No global variables, no system calls } ``` Explicit state enables reproducibility by controlling all inputs to the computation. ### Deterministic Algorithms Some algorithms are inherently non-deterministic; others have deterministic and non-deterministic variants. Choose deliberately: | Operation | Non-deterministic | Deterministic | |-----------|-------------------|---------------| | Sorting equal elements | Quicksort (unstable) | Mergesort (stable) | | Hash table iteration | Address order | Insertion or key order | | Parallel reduction | Thread-arrival order | Fixed tree reduction | | Random sampling | System RNG | Seeded PRNG | The deterministic variant may be slower or use more memory. For safety-critical systems, the trade-off usually favours determinism. ### Fixed-Point Arithmetic As detailed in [Fixed-Point Neural Networks](/insights/fixed-point-neural-networks/), integer arithmetic is deterministic across platforms: ```c // Non-deterministic: floating-point float result = a * b + c; // FMA? Intermediate precision? // Deterministic: fixed-point int64_t product = (int64_t)a * (int64_t)b; int32_t result = (int32_t)((product >> 16) + c); ``` The integer operations produce identical results on x86, ARM, RISC-V, and any other architecture with standard integer arithmetic. ## Verifying Determinism Design enables determinism; testing verifies it. Verification requires running the same computation on multiple platforms and comparing results bit-for-bit. ### Cross-Platform Testing The gold standard is identical binaries (where possible) or identical source compiled for each target: ```bash # Compile for each platform make ARCH=x86_64 make ARCH=aarch64 make ARCH=riscv64 # Run identical tests ./test_x86_64 > output_x86.bin ./test_aarch64 > output_arm.bin ./test_riscv64 > output_riscv.bin # Compare byte-for-byte diff output_x86.bin output_arm.bin diff output_x86.bin output_riscv.bin ``` Any difference indicates a determinism failure. The test should produce no output if all platforms match. ### Hash-Based Verification For large outputs, compare hashes rather than raw bytes: ```c void test_determinism(void) { fixed_t input[INPUT_SIZE]; fixed_t output[OUTPUT_SIZE]; // Known test input load_test_vector(input); // Run inference inference(input, output); // Compute hash uint8_t hash[32]; sha256(output, sizeof(output), hash); // Compare to expected hash (computed once, verified across platforms) const uint8_t expected[32] = { 0x3a, 0x7f, ... }; assert(memcmp(hash, expected, 32) == 0); } ``` If the hash matches the expected value, the output is bit-identical. If it differs, something has changed. ### Regression Testing Determinism tests should run on every build to catch regressions: ```yaml # CI configuration test_determinism: matrix: platform: [x86_64, aarch64, riscv64] steps: - compile: make ARCH=$platform - run: ./determinism_tests - verify: diff output.bin golden/output.bin ``` Golden outputs are generated once and stored in version control. Any change to the output indicates either a bug or an intentional change that requires updating the golden files. ### Fuzzing for Determinism Random input testing can find edge cases where determinism breaks: ```c void fuzz_determinism(int iterations) { for (int i = 0; i < iterations; i++) { // Generate random input (seeded for reproducibility) fixed_t input[INPUT_SIZE]; generate_random_input(input, seed + i); // Run twice fixed_t output1[OUTPUT_SIZE]; fixed_t output2[OUTPUT_SIZE]; inference(input, output1); inference(input, output2); // Must match exactly assert(memcmp(output1, output2, sizeof(output1)) == 0); } } ``` This catches cases where internal state leaks between invocations or where certain input patterns trigger non-deterministic code paths. ## Common Pitfalls Even systems designed for determinism can fail through subtle issues. ### Uninitialised Memory Reading uninitialised memory produces undefined values: ```c fixed_t buffer[SIZE]; // buffer contains garbage - uninitialised process(buffer); // Non-deterministic! ``` Always initialise memory explicitly: ```c fixed_t buffer[SIZE] = {0}; // Zero-initialised // or memset(buffer, 0, sizeof(buffer)); ``` ### Padding Bytes Struct padding may contain garbage: ```c typedef struct { int8_t a; // 3 bytes padding int32_t b; } padded_t; padded_t x; x.a = 1; x.b = 2; // Padding bytes are uninitialised hash(&x, sizeof(x)); // Hash includes garbage padding! ``` Zero the entire struct before use, or hash only the meaningful fields. ### Compiler Optimisations Aggressive optimisation can reorder floating-point operations: ```c // Source code float sum = a + b + c + d; // Compiler may emit float sum = (a + b) + (c + d); // Different rounding! ``` For floating-point code, use `-ffp-contract=off` and similar flags to disable operation fusion. For fixed-point code, this is not a concern since integer operations are not reordered in ways that change results. ### Library Differences Even deterministic code can produce different results if linked against different libraries: ```c // If libc differs between platforms, results may differ qsort(array, n, sizeof(int), compare); // Stable? Ordering of equal elements? ``` Either use libraries with guaranteed behaviour, or implement critical functions directly. ## Practical Considerations ### Performance Trade-offs Deterministic alternatives are sometimes slower: | Pattern | Non-deterministic | Deterministic | Overhead | |---------|-------------------|---------------|----------| | Parallel sum | Thread-order reduction | Fixed tree | ~10-20% | | Hash iteration | Pointer order | Sorted order | O(n log n) | | Memory layout | Allocator-dependent | Static/fixed | Minimal | | Floating-point | Hardware FMA | Software emulation | 2-5x | For safety-critical systems, the overhead is typically acceptable. Correctness and auditability outweigh raw performance. ### Scope of Determinism Not everything needs to be deterministic. A system can have: - Deterministic inference (same input → same output) - Non-deterministic logging (timestamps, which don't affect computation) - Non-deterministic scheduling (order of independent operations) The key is that non-determinism must not affect the auditable computation path. ### Versioning Bit-perfect reproducibility is relative to a specific version. Changing the code changes the outputs. Version management must track: - Source code version - Compiler version - Library versions - Target architecture Reproducibility claims are valid only within a specific configuration. Changing any component may change outputs. ## Implementation Reference The [certifiable-inference](https://github.com/williamofai/certifiable-inference) project demonstrates bit-perfect reproducibility: - Fixed-point arithmetic throughout - No floating-point in computation paths - Static allocation only - Deterministic algorithms - Cross-platform verification tests The test suite includes golden outputs verified across x86, ARM, and RISC-V. Any platform difference causes test failure. A [live simulator](https://inference.speytech.com/) demonstrates the deterministic inference pipeline interactively. ## Conclusion Bit-perfect reproducibility is achievable but requires deliberate design. The primary sources of non-determinism—floating-point arithmetic, hash table iteration, threading, memory allocation, and system calls—each have deterministic alternatives. The cost is reduced flexibility and sometimes reduced performance. The benefit is absolute certainty: the same input produces the same output, every time, on every platform. This enables cryptographic verification, exact replay, and transportable certification evidence. For safety-critical systems, where incident investigation may require reproducing exact behaviour months or years later, bit-perfect reproducibility is not merely convenient—it may be essential. As with any architectural approach, suitability depends on system requirements and verification objectives. Not all systems need bit-perfect reproducibility. But for those that do, achieving it is a matter of engineering discipline, not luck. --- *For a working implementation demonstrating bit-perfect reproducibility across platforms, see [certifiable-inference](https://github.com/williamofai/certifiable-inference) or try the [live simulator](https://inference.speytech.com/).* --- ## The Real Cost of Dynamic Memory in Safety-Critical Systems **URL**: https://speytech.com/insights/dynamic-memory-safety-critical/ **Published**: January 15, 2026 19:20 **Topic**: Safety-Critical AI Why malloc is problematic for certification and how static allocation can simplify verification Dynamic memory allocation is fundamental to modern software. Languages and frameworks assume heap access. Data structures grow and shrink as needed. Memory management happens automatically or through simple malloc/free pairs. In safety-critical systems, this flexibility becomes a liability. Certification standards like DO-178C (aerospace), IEC 62304 (medical devices), and ISO 26262 (automotive) impose requirements that make dynamic allocation difficult to verify and, in the most stringent classifications, effectively prohibited. Understanding why requires examining what dynamic allocation actually does at runtime and why those behaviours conflict with certification objectives. This article explores the technical problems with dynamic memory in safety-critical contexts, examines how certification standards address these concerns, and demonstrates static allocation patterns that can satisfy verification requirements for neural network inference. ## What Dynamic Allocation Does When code calls `malloc(size)`, the runtime memory allocator must: 1. **Search** for a free block of sufficient size 2. **Split** the block if it's larger than needed (in most implementations) 3. **Update** internal bookkeeping structures 4. **Return** a pointer to the allocated region When code calls `free(ptr)`, the allocator must: 1. **Validate** the pointer (in robust implementations) 2. **Mark** the block as available 3. **Coalesce** adjacent free blocks (in most implementations) 4. **Update** bookkeeping structures These operations have properties that create challenges for safety-critical systems. ### Variable Execution Time The time to complete malloc depends on the current state of the heap. A fresh heap with large contiguous free space satisfies requests quickly. A fragmented heap may require searching through many small blocks before finding one that fits. This variability makes worst-case execution time (WCET) analysis difficult. Safety-critical systems often require bounded timing: the system must respond within a guaranteed deadline. If malloc can take 10μs in the best case and 10ms in the worst case, the system must be designed for the 10ms case, wasting capacity in normal operation. ### Fragmentation Repeated allocation and deallocation of varying sizes fragments the heap. Free memory exists, but in pieces too small to satisfy requests. A system with 1MB free might fail to allocate 64KB because no single contiguous region remains. Fragmentation is non-deterministic. It depends on the exact sequence of allocations and frees, which may depend on input data, timing, or external events. Two runs of the same program with different inputs can produce different fragmentation patterns, causing one to succeed and the other to fail. ### Failure Modes When malloc cannot satisfy a request, it returns NULL. Correct handling of allocation failure is notoriously difficult: ```c // Common pattern - often wrong char* buffer = malloc(size); if (buffer == NULL) { // Now what? // - Return error? (caller must handle) // - Log and continue? (with what buffer?) // - Abort? (acceptable in safety-critical?) } ``` Every allocation site requires failure handling. Every failure handler must be tested. In a system with hundreds of allocation points, this creates a verification burden that grows with code complexity. ### Hidden State The heap is global mutable state shared across the entire program. An allocation in one module affects available memory for all other modules. A memory leak in a rarely-executed path may not manifest until hours or days into operation. This hidden coupling makes reasoning about system behaviour difficult. Two modules that appear independent may interact through heap exhaustion in ways that are hard to predict and harder to test. ## What Certification Standards Require Certification standards address these concerns with varying degrees of strictness. ### DO-178C (Aerospace) DO-178C does not explicitly prohibit dynamic allocation, but its objectives create significant barriers at higher Design Assurance Levels (DAL). **Objective A-5** requires demonstration that the software "does not have unintended functions." Dynamic allocation's dependence on runtime state makes this difficult to show exhaustively. **Objective A-7** requires verification that the software performs its intended functions. If allocation can fail, the failure paths must be verified, including demonstration that the system remains safe when memory is exhausted. For DAL A (catastrophic failure conditions), the verification burden is so high that most projects prohibit dynamic allocation after initialisation as a practical matter. The CAST-21 position paper from certification authorities explicitly addresses dynamic memory, noting that its use "ichever requires specific means to show compliance." ### IEC 62304 (Medical Devices) IEC 62304 Class C (highest safety classification) requires: - **Risk analysis** of each software item - **Verification** that software items meet requirements - **Traceability** from requirements through design to tests Dynamic allocation introduces risks (fragmentation, exhaustion, timing variance) that must be analysed, mitigated, and verified. For life-sustaining systems, the analysis overhead often exceeds the cost of redesigning with static allocation. ### ISO 26262 (Automotive) ISO 26262 Part 6 Table 1 lists "No dynamic objects or variables" as a recommendation for ASIL C and ASIL D systems. While not an absolute prohibition, deviating from recommendations requires documented rationale and alternative measures. The standard's emphasis on deterministic timing and freedom from interference aligns poorly with heap allocation's variable timing and global state. Design Property: Analysability Static allocation enables complete analysis of memory usage at compile time. Maximum memory consumption is known before the software runs, eliminating runtime exhaustion as a failure mode. ## The Static Allocation Alternative Static allocation means all memory is reserved at compile time or during initialisation, with no runtime allocation during normal operation. This approach trades flexibility for analysability. ### Caller-Provided Buffers Instead of functions allocating their own memory, callers provide buffers: ```c // Dynamic allocation pattern result_t* process_data(const input_t* input) { result_t* result = malloc(sizeof(result_t)); if (result == NULL) return NULL; // ... process ... return result; // Caller must free } // Static allocation pattern int process_data(const input_t* input, result_t* result) { // result buffer provided by caller // ... process ... return 0; // Success } ``` The caller controls memory lifetime. The function cannot fail due to allocation. Memory ownership is explicit rather than implicit. ### Fixed-Size Arrays Where dynamic arrays might grow, static allocation uses fixed maximum sizes: ```c // Dynamic pattern typedef struct { float* weights; // malloc'd, size varies int size; } layer_t; // Static pattern #define MAX_WEIGHTS 1024 typedef struct { float weights[MAX_WEIGHTS]; // Fixed at compile time int used; // Actual count in use } layer_t; ``` This wastes memory when actual usage is below the maximum. For safety-critical systems, the trade-off is often acceptable: memory is cheap compared to verification effort. ### Object Pools For systems that genuinely need runtime object creation, pools pre-allocate a fixed number of objects: ```c #define POOL_SIZE 32 typedef struct { message_t messages[POOL_SIZE]; uint8_t available[POOL_SIZE]; // 1 = free, 0 = in use int next_free; } message_pool_t; message_t* pool_acquire(message_pool_t* pool) { for (int i = 0; i next_free + i) % POOL_SIZE; if (pool->available[idx]) { pool->available[idx] = 0; pool->next_free = (idx + 1) % POOL_SIZE; return &pool->messages[idx]; } } return NULL; // Pool exhausted } void pool_release(message_pool_t* pool, message_t* msg) { int idx = msg - pool->messages; pool->available[idx] = 1; } ``` Pool exhaustion is still possible, but the maximum is known statically. Testing can verify behaviour when all objects are in use. There is no fragmentation because all objects are the same size. ## Application to Neural Network Inference Neural network inference is particularly amenable to static allocation because the memory requirements are known at model load time. ### Model Structure A neural network has fixed structure: - Layer count is fixed - Weight dimensions per layer are fixed - Maximum activation sizes between layers are fixed All of this is known when the model is deployed. There is no need for runtime allocation. ### Inference Buffers Inference requires temporary buffers for intermediate activations. Using [fixed-point arithmetic](/insights/fixed-point-neural-networks/) for determinism, the maximum size is determined by the largest layer: ```c // Determined at model compile time #define MAX_ACTIVATION_SIZE 4096 typedef struct { fixed_t weights[MAX_WEIGHTS]; fixed_t biases[MAX_BIASES]; fixed_t activation_a[MAX_ACTIVATION_SIZE]; fixed_t activation_b[MAX_ACTIVATION_SIZE]; } inference_ctx_t; void infer(inference_ctx_t* ctx, const fixed_t* input, fixed_t* output) { // Ping-pong between activation buffers const fixed_t* current = input; fixed_t* next = ctx->activation_a; for (int layer = 0; layer activation_a) ? ctx->activation_b : ctx->activation_a; } memcpy(output, current, OUTPUT_SIZE * sizeof(fixed_t)); } ``` Two activation buffers suffice for any depth network by alternating between them. The maximum size is the largest layer, not the sum of all layers. ### Convolution Buffers 2D convolution can be implemented with static buffers: ```c void fx_conv2d(const fixed_t* input, int in_h, int in_w, const fixed_t* kernel, int k_h, int k_w, fixed_t* output) { int out_h = in_h - k_h + 1; int out_w = in_w - k_w + 1; // No allocation - all pointers provided by caller for (int oh = 0; oh > 16); } } } ``` The function uses only stack variables (loop counters, accumulator) and caller-provided buffers. Stack usage is O(1), predictable and analysable. ## Verification Benefits Static allocation simplifies verification in several ways. ### Memory Usage Analysis With static allocation, maximum memory usage is computable at compile time: ``` Total RAM = sizeof(inference_ctx_t) + sizeof(input_buffer) + sizeof(output_buffer) + stack_high_water_mark ``` This total can be compared against available RAM with certainty. There is no "usually enough" or "depends on input." Either the memory fits or it doesn't. ### Stack Analysis Static allocation tools can compute precise stack usage for each function and call path. Combined with static buffers, the total memory footprint is fully characterised without running the code. ### No Exhaustion Testing If memory cannot be exhausted at runtime, exhaustion handling does not require testing. This eliminates a class of difficult-to-reach test cases and associated verification evidence. ### Timing Predictability Without malloc's variable search time, timing analysis becomes tractable. Loop bounds are known. Memory access patterns are predictable. WCET can be computed or measured with confidence. ## Trade-offs Static allocation is not without costs. Static Allocation Benefits Compile-time memory analysis No fragmentation No exhaustion during operation Predictable timing Simplified verification Static Allocation Costs Memory sized for worst case Maximum sizes must be known in advance Less flexible data structures May require architectural changes Not suitable for all applications For applications where input sizes vary enormously (text processing, general-purpose computing), static allocation may be impractical. For embedded inference with known model dimensions, the constraints are typically acceptable. ## Implementation Patterns ### Initialisation-Time Allocation Some systems permit allocation during initialisation but not during operation: ```c typedef struct { fixed_t* weights; fixed_t* activations; int initialised; } runtime_ctx_t; int runtime_init(runtime_ctx_t* ctx, const model_config_t* cfg) { // Allocation permitted here ctx->weights = malloc(cfg->weight_bytes); ctx->activations = malloc(cfg->activation_bytes); if (!ctx->weights || !ctx->activations) { free(ctx->weights); free(ctx->activations); return -1; // Initialisation failure is recoverable } ctx->initialised = 1; return 0; } int runtime_infer(runtime_ctx_t* ctx, const fixed_t* in, fixed_t* out) { // No allocation here - operation phase assert(ctx->initialised); // ... inference using pre-allocated buffers ... return 0; } ``` This pattern separates startup (where failure is recoverable) from operation (where allocation failure could be catastrophic). Certification focuses on demonstrating that no allocation occurs after initialisation. ### Compile-Time Configuration For maximum analysability, buffer sizes can be compile-time constants: ```c // config.h - generated from model analysis #define INPUT_SIZE 784 #define HIDDEN_SIZE 256 #define OUTPUT_SIZE 10 #define MAX_ACTIVATION MAX(INPUT_SIZE, MAX(HIDDEN_SIZE, OUTPUT_SIZE)) // inference.c static fixed_t g_activation_a[MAX_ACTIVATION]; static fixed_t g_activation_b[MAX_ACTIVATION]; ``` The sizes are visible in the source. Static analysis tools can compute total memory without execution. Changes to the model require recompilation, but the memory contract is explicit. ## Practical Considerations ### Memory Overhead Static allocation typically uses more memory than dynamic allocation would for the same workload. The overhead is the difference between worst-case and average-case usage. For neural network inference, the overhead is often modest. Layer sizes don't vary at runtime; they're fixed by the model architecture. The "worst case" is every case. ### Code Portability Static allocation patterns require larger buffers to be passed through call chains. This can make APIs more verbose: ```c // Dynamic style - cleaner API image_t* resize(const image_t* src, int new_w, int new_h); // Static style - explicit buffers int resize(const image_t* src, image_t* dst, int new_w, int new_h, uint8_t* scratch, size_t scratch_size); ``` The verbosity is the cost of explicitness. For safety-critical code, explicit memory management is typically preferred despite the syntactic overhead. ### Legacy Integration Existing codebases may assume dynamic allocation. Migration to static allocation can require significant refactoring. For new safety-critical projects, designing for static allocation from the start is substantially easier than retrofitting. ## Implementation Reference The [certifiable-inference](https://github.com/williamofai/certifiable-inference) project demonstrates static allocation throughout: - All buffers provided by callers - No malloc after initialisation - Fixed-size structures with compile-time dimensions - O(1) stack usage in all operations The implementation is suitable for environments where dynamic allocation is prohibited and provides a reference for teams designing their own safety-critical inference engines. ## Conclusion Dynamic memory allocation introduces variability that conflicts with safety-critical certification objectives. Timing varies with heap state. Fragmentation can cause late-life failures. Exhaustion handling creates verification burden. Static allocation eliminates these concerns by moving memory decisions to compile time. The cost is reduced flexibility and potential memory overhead. For applications with known memory requirements, like neural network inference with fixed model dimensions, the trade-off typically favours static allocation. Certification standards increasingly recognise this trade-off. While few explicitly prohibit dynamic allocation, the verification burden it creates pushes safety-critical projects toward static patterns. Understanding why helps engineers make informed architectural decisions early in development, when the cost of change is lowest. As with any architectural approach, suitability depends on system requirements, memory constraints, and regulatory context. Static allocation is not universally appropriate, but for safety-critical AI where certification is required, it offers a verification-friendly foundation. --- *For a working implementation of static allocation patterns for neural network inference, see [certifiable-inference](https://github.com/williamofai/certifiable-inference) or try the [live simulator](https://inference.speytech.com/).* --- ## From Proofs to Code: Mathematical Transcription in C **URL**: https://speytech.com/insights/mathematical-proofs-to-code/ **Published**: January 12, 2026 23:25 **Topic**: Formal Methods How mathematical contracts become deterministic implementations ## The Gap Between Mathematics and Implementation Software engineers learn early that implementation differs from specification. Requirements written in natural language get interpreted, approximated, and sometimes misunderstood during coding. Even when requirements are precise, the translation to code introduces opportunities for divergence. In safety-critical systems, this gap represents risk. When software controls flight surfaces, medical devices, or autonomous vehicles, the question "does the code match the specification?" isn't academic—it's a certification requirement. Standards like DO-178C, ISO 26262, and IEC 62304 all demand evidence that implementations correctly realize their requirements. One approach to closing this gap is mathematical transcription: the practice of expressing requirements as formal proofs, then translating those proofs mechanically into code. When done properly, the code becomes a direct representation of proven mathematical properties rather than an interpretation of natural language requirements. ## What Mathematical Transcription Is Not Before describing the approach, it helps to clarify what we're not discussing. This is not automated code generation from formal specifications, though such tools exist. Systems like Coq's extraction facility or SPARK's proof-carrying code can generate implementations from high-level specifications. These are powerful approaches, particularly for purely functional code or systems where the target language matches the specification language well. This is also not traditional "design by contract" in the sense Eiffel popularized, where preconditions and postconditions are runtime checks added to existing code. While contracts are central to our approach, we're describing something more fundamental: using mathematical proofs to determine what the code should be, not just how to verify it afterward. What we're describing is a methodology where mathematical analysis precedes and directs implementation. The proof comes first, the struct design follows from the proof, and the code follows from the struct design. Each step is a mechanical translation of the previous one. ## Stage 1: Mathematical Analysis The process begins with a question: what are we trying to accomplish, stated mathematically? Consider a simple monitoring function that should trigger an alert when a value exceeds a threshold. In natural language: "Monitor a value and alert when it crosses the threshold." Simple enough, but this leaves critical questions unanswered: What is the domain of valid values? What happens at boundary conditions? What if the value equals the threshold exactly? What are the timing requirements? Can the value change during evaluation? Mathematical analysis makes these implicit assumptions explicit: ``` Let v ∈ ℕ be the monitored value, where 0 ≤ v value threshold value >= m->threshold) { m->triggered = 1; } /* Postcondition: alert state correctly reflects comparison */ assert((m->value >= m->threshold) == (m->triggered == 1)); } ``` The assertions are not defensive programming—they're the mechanized form of our mathematical requirements. The preconditions ensure inputs satisfy our proof's assumptions. The postcondition verifies the implementation matches the proven logic. The core logic (`if (m->value >= m->threshold)`) is the direct C translation of our mathematical alert condition (v ≥ t). There's no interpretation, no approximation—just transcription. ## Why This Works for Safety-Critical Systems This approach addresses specific challenges in safety-critical software development. ### Verification Tractability When code is a direct transcription of proofs, verification becomes checking transcription accuracy rather than reasoning about complex implementation logic. Reviewers can compare the code against the proof term by term, confirming that each mathematical requirement has a corresponding code representation. This makes manual review more effective and automated verification more tractable. Tools like Frama-C, SPARK, or TLA+ can verify that implementations satisfy their specifications because the gap between specification and implementation is minimal. ### Certification Evidence Safety standards require evidence that implementations correctly realize requirements. When the implementation process is "prove, then transcribe," the mathematical proof itself serves as certification evidence. DO-178C Level A, for instance, requires that safety requirements be traceable to design and implementation. When struct fields trace directly to proof elements, and code traces directly to proof logic, this traceability is inherent rather than documented after the fact. ### Maintenance Clarity Software changes over time. When modifications are needed, the mathematical proof provides the definitive specification of what the code is supposed to do. Engineers making changes can first update the proof to reflect new requirements, then update the struct and code to match. This inverts the typical maintenance process where engineers read existing code, infer its intent, make modifications, and hope they haven't broken invariants. With proof-first development, invariants are explicit and breaking them requires conscious proof modifications. ## Properties Preserved Through Transcription The transcription process is designed to preserve specific properties from proof to implementation: Design Property: Bounds Preservation Type choices in the struct design enforce the bounds defined in the mathematical proof, such that values cannot exist outside their proven domains. By choosing `uint32_t` with explicit bounds checking, we guarantee that values cannot violate the mathematical assumptions. The type system participates in maintaining proof properties. ### Totality Mathematical proofs operate on total functions—functions defined for all inputs in their domain. Partial functions, which are undefined for some inputs, introduce complexity and risk. The transcription process preserves totality by ensuring all code paths are defined. The `if` statement in our example handles both cases: value below threshold (no action) and value at or above threshold (set triggered flag). There are no missing cases, no undefined behavior. ### Determinism Proofs assume deterministic evaluation: given identical inputs, the function produces identical outputs. The transcription preserves this by avoiding sources of nondeterminism. No dependency on system state beyond the monitored struct. No reliance on timing, no global variables, no hidden state. The function is pure: its behavior depends only on its inputs and affects only its outputs. ## Practical Example: State Machine Transitions The approach scales beyond simple monitors to more complex systems. Consider a state machine with proven transition conditions: ``` States: S = {IDLE, ACTIVE, ERROR} Transitions proven valid: IDLE → ACTIVE when condition C1 holds ACTIVE → IDLE when condition C2 holds ACTIVE → ERROR when condition C3 holds ERROR → IDLE when reset() called ``` The struct represents this exactly: ```c typedef enum { STATE_IDLE, STATE_ACTIVE, STATE_ERROR } state_t; typedef struct { state_t current_state; uint32_t tick_count; /* For C1, C2, C3 evaluation */ uint8_t error_detected; /* For C3 */ } fsm_t; ``` And transitions transcribe the proven logic: ```c void fsm_transition(fsm_t* fsm) { assert(fsm != NULL); switch (fsm->current_state) { case STATE_IDLE: if (condition_c1(fsm)) { fsm->current_state = STATE_ACTIVE; } break; case STATE_ACTIVE: if (condition_c3(fsm)) { fsm->current_state = STATE_ERROR; } else if (condition_c2(fsm)) { fsm->current_state = STATE_IDLE; } break; case STATE_ERROR: /* Only reset() can exit ERROR state */ break; } } ``` The code structure mirrors the mathematical transition rules. Each proven transition has exactly one corresponding code path. No transitions occur that weren't proven valid. ## Testing as Proof Validation When code is transcribed from proofs, testing serves a different purpose than in conventional development. Traditional testing explores implementation behavior to find defects. Proof-based testing validates that transcription was performed correctly. Test cases come directly from proof properties: ```c /* Test: boundary condition from proof */ void test_threshold_boundary(void) { monitor_t m = { .value = 100, .threshold = 100, .triggered = 0 }; monitor_check(&m); assert(m.triggered == 1); /* v ≥ t should trigger */ } /* Test: totality - below threshold case */ void test_below_threshold(void) { monitor_t m = { .value = 99, .threshold = 100, .triggered = 0 }; monitor_check(&m); assert(m.triggered == 0); /* v < t should not trigger */ } ``` These aren't arbitrary test cases chosen to exercise code paths. They're mechanized checks of the mathematical properties we proved. If a test fails, either the transcription was incorrect or the proof was incomplete. ## Where Formal Verification Tools Enter Mathematical transcription can proceed without automated tools—it's fundamentally a human activity of careful translation. However, formal verification tools can validate that transcription was performed correctly. Tools like Frama-C with its WP plugin allow expressing the mathematical proof as ACSL annotations, then proving the C implementation satisfies them. The process becomes: 1. Write mathematical proof 2. Design struct from proof 3. Transcribe to C implementation 4. Express proof as ACSL annotations 5. Use WP to verify implementation matches annotations The tool doesn't replace the human insight of creating the proof or designing the struct. It validates that the mechanical transcription step was performed correctly. Similarly, SPARK for Ada provides a framework where the language itself enforces many proof properties through its type system and runtime checks. The transcription process adapts to leverage language capabilities—Ada's strong typing preserves more proof properties automatically than C requires us to enforce manually. ## Real-World Application Patterns Organizations using proof-based transcription typically follow certain patterns: ### Safety Kernels In safety-critical systems, a small "safety kernel" is often proven formally while surrounding functionality uses conventional development. The kernel handles critical functions like mode management, watchdog timing, or health monitoring. These components are small enough that complete mathematical treatment is tractable. The [c-from-scratch educational course](https://github.com/williamofai/c-from-scratch) demonstrates this pattern by building a heartbeat monitor using the prove-first methodology. The resulting code is approximately 200 lines, small enough for complete formal treatment. ### Incremental Adoption Rather than requiring all code to follow proof-based transcription, teams often start with the most critical components. A voting algorithm, a critical state machine, or a safety interlock gets the full mathematical treatment. Less critical code uses conventional practices. This allows teams to gain experience with the methodology while limiting the investment. As the approach proves valuable, it expands to additional components. ### Certification-Driven Development For systems requiring DO-178C Level A or similar certification, proof-based transcription provides clear traceability from requirements through verification. Each certification deliverable maps to a specific artifact: requirements become mathematical proofs, design becomes struct definitions, implementation becomes transcribed code, and verification becomes proof validation. ## Limitations and Practical Considerations Mathematical transcription has clear benefits for safety-critical systems, but also limitations. The approach works best for systems where behavior can be characterized mathematically. Control logic, state machines, monitoring functions, and algorithms map well to mathematical specification. Interactions with hardware, timing-dependent behavior, or systems with emergent properties are more challenging. The methodology requires engineers comfortable with mathematical reasoning. Not every programming problem needs proof-first development, and not every engineer finds the approach natural. Organizations adopting this approach must invest in training and accept that some problems resist mathematical characterization. Finally, proof-based transcription doesn't eliminate all defects. The proof might be incorrect, the transcription might introduce errors, or the requirements themselves might be wrong. What it does provide is a systematic approach to reducing the gap between specification and implementation. ## Integration with Deterministic Platforms Proof-based transcription naturally aligns with deterministic execution platforms. When proofs assume deterministic behavior, implementing on platforms that guarantee determinism preserves proof properties through execution. The [Murray Deterministic Computing Platform (MDCP)](/mdcp/) provides execution semantics that match the assumptions common in mathematical proofs: bounded resources, deterministic scheduling, and reproducible behavior. Code transcribed from proofs runs on MDCP with the same properties the proofs assume. This integration matters for certification. When an implementation is proven correct under certain execution assumptions, those assumptions must hold in the deployed system. Deterministic platforms make those assumptions explicit and verifiable. ## Conclusion Mathematical transcription offers a systematic approach to closing the gap between specification and implementation in safety-critical systems. By proving properties first, designing structs to represent those properties, and transcribing the proven logic to code, we create implementations that are traceable to their requirements by construction. The approach is not suitable for all software. It requires mathematical characterization of requirements, engineers comfortable with formal reasoning, and acceptance that upfront proof effort exceeds initial coding time. For systems where certification matters, where errors have severe consequences, or where long-term maintainability justifies upfront investment, the benefits can justify the cost. The key insight is treating implementation as a mechanical process rather than a creative one. Once the mathematics is worked out, the code practically writes itself. What remains is careful transcription and systematic validation that the transcription preserves the proven properties. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. For organizations developing safety-critical software under standards like DO-178C or ISO 26262, proof-based transcription provides a path to the evidence and traceability those standards require. --- ## The Hidden Cost of Non-Determinism **URL**: https://speytech.com/insights/nondeterminism-cost-debugging/ **Published**: January 12, 2026 19:50 **Topic**: Business Case Understanding the financial impact of debugging race conditions and Heisenbugs Note: The cost estimates and financial figures discussed in this article are indicative based on industry patterns and published research. Actual costs vary significantly depending on organization size, domain, system complexity, and development practices. These figures should not be used as the basis for specific investment decisions without conducting organization-specific analysis. ## The Iceberg Problem When a CFO reviews a software project's budget, they typically see line items for development, testing, and deployment. What they don't see is the disproportionate cost hidden beneath the surface: the engineering time consumed by nondeterministic defects that resist conventional debugging approaches. Nondeterministic bugs, often called Heisenbugs, are defects whose behavior changes or disappears when you attempt to study them. They arise from race conditions, timing dependencies, uninitialized variables, and other sources of execution variability. Unlike deterministic bugs that can be reliably reproduced and fixed, these elusive defects can consume weeks or months of senior engineering time. The financial impact is rarely captured in project accounting. Organizations track "development costs" and "testing costs" but seldom isolate the specific overhead attributable to nondeterministic behavior. Yet this hidden cost can be substantial enough to affect project ROI, market timing, and competitive position. ## Understanding the Cost Structure The economics of nondeterministic debugging differs fundamentally from conventional defect resolution. A typical deterministic bug might follow this pattern: discover the issue, reproduce it reliably, identify the root cause, implement a fix, verify the correction. Total time: hours to days. A nondeterministic defect follows a very different trajectory. ### Investigation Time Multiplication When engineers encounter a Heisenbug, the first challenge is simply determining that they're dealing with nondeterministic behavior rather than environmental factors or user error. This triage phase can consume significant time as teams attempt various reproduction strategies. Once identified as nondeterministic, the debugging process becomes probabilistic. An engineer might spend days attempting to reproduce a race condition that manifests only under specific timing circumstances. Adding logging to understand the issue may alter the timing enough that the bug disappears, a phenomenon that can lead to false confidence that the problem has been resolved. Research on debugging effectiveness suggests that nondeterministic defects can require three to ten times more investigation effort than comparable deterministic issues. This multiplier reflects not just direct debugging time but also the cognitive overhead of reasoning about multiple possible execution paths and the frustration of working with unreliable reproduction. ### Context Switching Overhead Heisenbugs rarely follow a linear debugging path. An engineer begins investigating, hits a reproduction failure, switches to other work, then returns when the bug resurfaces. Each context switch carries cognitive overhead as the engineer re-establishes mental models of the code, the suspected failure mode, and the debugging strategy. In organizations with multiple ongoing projects, nondeterministic issues can pull senior engineers away from planned work repeatedly as bugs resurface in production or testing environments. The opportunity cost of this disruption compounds the direct debugging time cost. ### Verification Challenges Once a fix is implemented for a nondeterministic defect, verification becomes problematic. How do you confirm that a race condition has been eliminated? Running the test suite once and seeing success proves little when the original issue manifested intermittently. Teams often resort to stress testing, running the same scenarios hundreds or thousands of times to gain statistical confidence. This verification burden adds time to the fix-deploy cycle and requires additional infrastructure for reliable stress testing environments. ## Domain-Specific Cost Amplification The financial impact of nondeterministic debugging varies by domain, with safety-critical systems experiencing particularly severe cost amplification. ### Aerospace and Automotive Systems In aerospace or automotive contexts governed by standards like DO-178C or ISO 26262, nondeterministic behavior creates certification risk. A bug that cannot be reliably reproduced is difficult to analyze for safety implications. The investigation must not only identify the root cause but also demonstrate that all related scenarios have been addressed. Certification timelines in these domains can extend by months when nondeterministic issues arise late in development. Given that certification delays typically cost organizations in the range of hundreds of thousands to millions per month in delayed revenue and continued development expenses, even a single stubborn Heisenbug can have substantial financial implications. ### Medical Device Development IEC 62304 Class C medical device software faces similar challenges. Post-market incidents involving nondeterministic behavior can trigger regulatory actions, field corrections, and potential litigation exposure. The cost of a single post-deployment Heisenbug in a medical device context can exceed the entire development budget when you factor in investigation, regulatory response, field updates, and potential liability. ### Financial Trading Systems In high-frequency trading or transaction processing systems, nondeterministic behavior can have immediate P&L impact. A race condition that causes order mispricing or double execution represents both direct financial loss and regulatory exposure. The cost of such issues is measured not in engineering hours but in trading losses and potential regulatory penalties. ## The Certification Timeline Impact One of the least visible but most significant cost impacts of nondeterminism appears in the certification and regulatory approval process for safety-critical systems. Certification bodies require evidence that systems behave correctly under all operational conditions. When nondeterministic behavior is present, collecting this evidence becomes challenging. Test results that vary between runs undermine confidence and can lead to requests for additional testing, architectural changes, or enhanced monitoring. An organization pursuing DO-178C Level A certification might plan for a 24-month certification timeline. If nondeterministic issues emerge during verification and validation, that timeline can extend to 30 or 36 months. The financial impact of this extension includes continued development expenses, delayed market entry, and competitive disadvantage. For a mid-sized aerospace supplier with a DO-178C Level A component, each quarter of certification delay might represent opportunity cost in the range of several million in delayed revenue, based on typical product pricing and market dynamics. This far exceeds the direct cost of the engineering time spent debugging the nondeterministic issues themselves. ## Hidden Costs in Team Productivity Beyond direct debugging time, nondeterministic issues affect team productivity in subtle ways that rarely appear in project accounting. ### Morale and Retention Senior engineers working on Heisenbugs describe the experience as uniquely frustrating. The lack of reliable reproduction makes progress difficult to measure, and the probabilistic nature of verification means that apparent solutions may prove illusory. This frustration affects morale and can contribute to retention challenges in teams working on systems with frequent nondeterministic issues. Recruiting costs for senior embedded engineers or safety-critical developers can range from tens of thousands to over $100,000 per position when you factor in recruiting fees, relocation, and productivity ramp-up time. If nondeterministic debugging contributes to turnover of even one senior engineer per year, the hidden cost becomes significant. ### Technical Debt Accumulation Teams facing pressing nondeterministic issues often take shortcuts. Rather than resolving the root cause, they might add defensive code, implement workarounds, or adjust timing assumptions to reduce symptom frequency. These quick fixes add technical debt that compounds over time. Later refactoring to remove these workarounds and address underlying architectural issues can consume substantial effort. Organizations may carry this technical debt for years, accepting degraded system performance or maintainability because the cost of properly addressing the underlying nondeterminism seems prohibitive. ## Measurement and Visibility Challenges One reason nondeterminism costs remain hidden is the difficulty of measurement. Traditional project accounting captures development phases, feature implementation, and testing cycles. It doesn't distinguish between time spent implementing planned functionality and time spent investigating Heisenbugs. Some organizations attempting to quantify this cost use time-tracking categories that separate "defect investigation" from "development," but this still doesn't capture the full picture. The context switching overhead, opportunity cost of delays, and technical debt accumulation rarely show up in direct cost accounting. ### Establishing Baseline Metrics Organizations serious about understanding nondeterminism costs can establish metrics tracking reproduction time, investigation duration, and verification cycles for different defect categories. Over time, these metrics can reveal patterns showing which system components or architectural approaches correlate with high nondeterminism costs. A simple starting point: track "time to reliable reproduction" for each defect. Issues requiring more than a few hours to reproduce reliably should be flagged for analysis. Patterns in these flags can indicate architectural areas requiring attention. ## Deterministic Architecture as a Cost Mitigation Strategy Deterministic computing platforms address nondeterminism costs by eliminating variability in execution. When a system produces identical results for identical inputs, several cost categories decrease: Investigation time reduces because bugs can be reliably reproduced. An engineer encountering an issue can capture the input conditions and replay the exact execution path to understand what happened. Verification costs decrease because a test that passes once validates the fix comprehensively rather than providing statistical confidence. Teams can verify corrections quickly rather than running stress tests. Certification timelines can be compressed because evidence collection becomes more straightforward. Test logs represent reproducible facts about system behavior rather than samples from a probability distribution. The [debugging economics article](/insights/debugging-economics/) explores the technical mechanisms through which deterministic execution enables faster defect resolution. The financial benefit stems from reducing the investigation time multiplier from potentially 5-10× down to roughly 1-2× relative to comparable deterministic issues. ## ROI Considerations for Architectural Change Adopting deterministic architecture requires upfront investment. Development practices must change, teams need training, and some systems may require refactoring to eliminate sources of nondeterminism. CFOs evaluating such investments should consider both direct costs and opportunity costs. The direct costs are relatively bounded: training expenses, potential tooling changes, and engineering time for architectural modifications. These might range from tens of thousands to hundreds of thousands depending on organization size and system complexity. The opportunity costs can be more significant. If deterministic architecture enables faster certification, earlier market entry, or reduced field support costs, the financial benefit may substantially exceed the direct implementation cost. However, these benefits are inherently uncertain and depend on factors like regulatory environment, competitive dynamics, and product lifecycle. ### Decision Framework Organizations can approach this decision systematically by estimating current nondeterminism costs, projecting potential reduction from deterministic architecture, and comparing against implementation costs: First, analyze historical data on defect resolution time, certification timeline extensions, and field support incidents attributable to nondeterministic behavior. This establishes a baseline cost. Second, project the potential reduction in these costs if investigation time multipliers decrease and certification evidence collection becomes more tractable. Make conservative assumptions about improvement magnitude. Third, estimate the implementation cost including training, architectural changes, and any productivity impact during transition. The comparison provides a basis for evaluating whether the investment makes financial sense in the specific organizational context. ## Risk-Adjusted Perspective The financial analysis of nondeterminism costs should also consider risk scenarios. While most Heisenbugs represent costly debugging time, some can trigger catastrophic outcomes in safety-critical contexts. High-profile incidents in aerospace and automotive domains, while typically involving multiple contributing factors, have sometimes included components where timing-dependent behavior complicated investigation or contributed to failure modes. The financial impact of such incidents extends beyond immediate costs to include regulatory scrutiny, market reputation damage, and potential litigation exposure. From a CFO perspective, investing in deterministic architecture can be viewed partly as risk mitigation. Even if the expected value calculation based on average debugging costs doesn't strongly favor the investment, the tail risk reduction may justify it in domains where system failures carry severe consequences. ## Practical Implementation Path For organizations considering deterministic architecture to address nondeterminism costs, a phased approach can manage both financial risk and implementation complexity. Starting with new safety-critical components rather than retrofitting entire systems limits initial investment while still capturing benefits where they matter most. As teams gain experience and measurement validates the cost reduction, the approach can expand to additional system areas. This incremental strategy allows financial performance to be tracked and investment decisions to be adjusted based on actual results rather than projections. Early wins in certification timeline compression or debugging efficiency can justify broader adoption. The [Murray Deterministic Computing Platform (MDCP)](/mdcp/) demonstrates this approach by providing deterministic kernels that can host critical functions while integrating with existing system architectures. Similarly, [CardioCore](/cardiocore/) shows application in medical device contexts where both certification timeline and post-market risk justify the architectural investment. ## Conclusion The hidden cost of nondeterminism represents a significant but often unmeasured drain on software development economics, particularly in safety-critical domains. These costs appear in prolonged debugging cycles, extended certification timelines, accumulated technical debt, and opportunity costs from delayed market entry. For CFOs and engineering leadership evaluating architectural investments, understanding these hidden costs provides important context. While deterministic architecture requires upfront investment, the potential for reducing investigation time multipliers, compressing certification timelines, and mitigating field support costs can represent substantial financial benefit. The key is measurement. Organizations that begin tracking reproduction time, investigation duration, and certification delays attributable to nondeterministic behavior can make data-driven decisions about architectural approaches. Those that fail to measure these costs may continue spending far more on Heisenbug investigation than they realize, while missing opportunities for cost reduction through architectural choices. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. The financial case for deterministic architecture strengthens in domains where certification timelines matter, where post-deployment defects carry high costs, and where engineering time is a limiting factor in development velocity. --- ## ISO 26262 and ASIL-D: The Role of Determinism **URL**: https://speytech.com/insights/iso-26262-asil-d-determinism/ **Published**: January 12, 2026 19:20 **Topic**: Certification How deterministic architecture can support automotive functional safety certification Note: This article discusses architectural approaches that can support ISO 26262 compliance. It does not provide legal, regulatory, or certification advice. Actual certification outcomes depend on system context, implementation details, assessor interpretation, and comprehensive safety analysis. Organizations pursuing certification should work directly with qualified functional safety consultants and certification bodies. ## The ASIL-D Challenge ISO 26262 defines four Automotive Safety Integrity Levels (ASIL A through D), with ASIL D representing the highest classification of hazard and requiring the most stringent safety measures. Systems like airbags, anti-lock brakes, and power steering typically require ASIL D classification because the risks associated with their failure are potentially fatal. ASIL classification is determined through Hazard Analysis and Risk Assessment (HARA), evaluating three factors: Severity of potential injury, Exposure to operational conditions where the hazard could occur, and Controllability by the driver to prevent harm. A combination of the most severe ratings across these factors results in ASIL D classification. The challenge facing automotive safety engineers is straightforward: ASIL D requires the highest level of assurance that dependent safety goals are sufficient and have been achieved. This assurance must be demonstrated through rigorous verification, validation, and evidence generation throughout the development lifecycle. ## Where Nondeterminism Creates Certification Risk Traditional automotive E/E architectures face a fundamental challenge during certification: the behavior of nondeterministic systems can vary between executions, even with identical inputs. This variability introduces specific risks during the ISO 26262 certification process. ### Verification Complexity The standard specifies safety requirements that are more rigorous at higher ASIL levels in order to achieve the required lower likelihood of dangerous failures. When system behavior includes nondeterministic elements such as thread scheduling variations, interrupt timing dependencies, or floating-point optimization differences, verification becomes challenging. Consider a critical safety function that must respond within defined timing bounds. In a nondeterministic system, the actual response time may vary based on scheduler decisions, cache states, or interrupt patterns. Demonstrating that the system meets its timing requirements under all possible execution paths becomes an evidence problem: how do you prove consistent behavior when the system itself doesn't behave consistently? ### Evidence Generation Challenges ISO 26262 determines requirements for validation, verification, and confirmation measures to ensure a sufficient and acceptable level of safety being achieved. These measures depend fundamentally on the ability to collect meaningful evidence about system behavior. In nondeterministic systems, test results may vary between runs. A test that passes in one execution might fail in another, not because the system changed, but because the underlying execution path differed. This phenomenon, known as flaky testing, undermines confidence in the verification process and makes it difficult to demonstrate that safety requirements have been met consistently. ### Post-Incident Analysis Requirements ISO 26262 uses a risk-based approach where hazards are assessed and safety measures are defined to avoid or control systematic failures and to detect or control random hardware failures or mitigate their effects. When an incident occurs during testing or field operation, understanding what happened requires the ability to reconstruct the exact sequence of events. Nondeterministic systems make reconstruction challenging. If you cannot reliably reproduce the conditions that led to a failure, you cannot definitively determine whether the failure resulted from a systematic error in the design or from an environmental factor. This ambiguity creates risk during certification assessments. ## How Deterministic Architecture Can Support Certification Deterministic computing platforms are designed to produce bit-identical results for identical inputs, eliminating execution path variability. This architectural property can support several aspects of the ISO 26262 certification process. Design Property: Execution Reproducibility Under defined conditions, a deterministic system executing the same input sequence produces identical outputs and identical internal state transitions, enabling precise verification of safety requirements. ### Verification Benefits With deterministic execution, verification testing becomes more tractable. A test that validates a safety function's behavior represents not just one execution path, but all execution paths for that input sequence. This property can reduce the verification burden by eliminating the need to account for scheduling variations, cache state differences, and other sources of nondeterministic behavior. For timing analysis, deterministic systems provide clearer bounds. Worst-case execution time analysis becomes more precise when the scheduler behavior is deterministic and when there are no hidden variables affecting execution timing. ### Evidence Quality Deterministic platforms can generate higher-quality evidence for certification assessments. Test logs from deterministic systems represent reproducible facts about system behavior rather than samples from a distribution of possible behaviors. When an assessor reviews verification evidence, deterministic execution provides stronger assurance that the tested behavior matches the deployed behavior. ISO 26262 highly recommends the use of semi-formal modeling languages for ASIL D designs, with executable validation using either prototyping or simulation being mandatory. Deterministic execution supports this requirement by ensuring that simulation results directly correspond to actual system behavior. ### Incident Investigation When issues arise during development or field operation, deterministic systems enable precise reconstruction of events. Given the same initial state and input sequence, the system will traverse the same execution path, allowing engineers to understand exactly what happened and why. This capability supports root cause analysis and helps distinguish between systematic design errors and environmental factors. ## Architectural Considerations for Determinism Supporting ISO 26262 certification through deterministic architecture requires attention to several design elements. ### Deterministic Scheduling The scheduler must make consistent decisions based on defined priority schemes rather than incorporating timing-dependent or random elements. Fixed-priority scheduling with well-defined preemption rules provides predictable behavior that can be analyzed and verified. ### Elimination of Hidden State Sources of nondeterminism such as uninitialized variables, timing-dependent race conditions, or hardware-dependent cache behavior must be addressed at the architectural level. The system should expose all relevant state explicitly so that verification tools can reason about system behavior comprehensively. ### Bounded Execution Paths All system operations should have defined completion bounds. Unbounded loops, unlimited retry mechanisms, or timing-dependent polling introduce variability that complicates verification. Deterministic architectures typically enforce resource bounds that make all execution paths analyzable. ### Formal Verification Support For ASIL D, ISO 26262 highly recommends semi-formal methods. Deterministic systems are more amenable to formal verification techniques because their behavior can be modeled completely. The absence of nondeterministic elements means that model checking and theorem proving tools can provide stronger guarantees about system properties. ## Integration with Safety Mechanisms Deterministic architecture is designed to complement rather than replace traditional safety mechanisms required by ISO 26262. The standard defines requirements for hardware and software safety mechanisms at the system level. Deterministic execution can enhance these mechanisms. For example, watchdog timers become more reliable when system timing is predictable. Redundancy and voting schemes produce more meaningful results when comparing outputs from deterministically-executed functions. Diagnostic routines can detect anomalies more accurately when normal behavior is consistent. The [Murray Deterministic Computing Platform (MDCP)](/mdcp/) demonstrates this integration approach by providing deterministic kernels that can host safety-critical functions while maintaining compatibility with ISO 26262 verification requirements. Similarly, [CardioCore](/cardiocore/) shows how deterministic principles apply to medical device certification under related standards like IEC 62304. ## Real-World Application Patterns Organizations pursuing ASIL D certification while leveraging deterministic architecture typically follow certain patterns: ### Incremental Adoption Rather than requiring full system determinism, teams often start by making safety-critical functions deterministic while accepting nondeterministic behavior in non-safety-critical components. This approach allows them to gain certification benefits where they matter most while managing development complexity. ### Evidence-First Development With deterministic execution enabling precise evidence generation, development processes can emphasize evidence collection from the start. Every test run produces reproducible evidence that can directly support certification documentation. ### Continuous Verification The reproducibility properties of deterministic systems support continuous integration and verification. Automated test suites can run with confidence that passing results represent genuine safety requirement satisfaction rather than fortunate scheduling outcomes. ## Limitations and Trade-offs Deterministic architecture is designed to support certification, not guarantee it. Several considerations remain: The standard evaluates entire systems, not just computation platforms. Even with deterministic execution, factors such as sensor quality, actuator response characteristics, hardware reliability, and system integration all influence certification outcomes. Determinism adds design constraints. Systems must be architected to avoid nondeterministic patterns, which may require different approaches than conventional embedded development. This can affect development timelines and team training requirements. ISO 26262 uses a risk-based approach where assessors evaluate whether implemented measures adequately address identified hazards. Assessor interpretation plays a significant role, and different certification bodies may have varying perspectives on the value of deterministic architecture in meeting standard requirements. ## Strategic Perspective For organizations developing ASIL D automotive systems, deterministic architecture represents a design choice that can address specific certification challenges. It does not eliminate the need for rigorous safety engineering, comprehensive hazard analysis, or robust testing practices. Rather, it provides architectural support for these activities by making system behavior more predictable and verifiable. The decision to pursue deterministic architecture should be based on system requirements, development constraints, and certification strategy. Organizations should evaluate whether the verification benefits and evidence quality improvements align with their specific safety goals and regulatory context. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. Teams should work with qualified functional safety consultants to determine whether deterministic architecture supports their certification objectives. --- ## The Hidden Cost of 'Consistent Environments': What Docker Actually Guarantees **URL**: https://speytech.com/insights/docker-hidden-cost/ **Published**: January 12, 2026 18:30 **Topic**: Systems Architecture Why shipping 450MB to avoid inconsistency may be solving the wrong problem Update (January 17, 2026): Based on feedback from Jan Tymiński and others, several clarifications have been added to distinguish deployment consistency from execution determinism, and to better acknowledge Docker's orchestration benefits. The core thesis remains unchanged. Docker solved a real problem. Before containers, deploying software meant hoping the production environment matched development. Library versions drifted. System configurations diverged. "Works on my machine" became a punchline because it happened constantly. Containers fixed this by shipping everything: application, dependencies, libraries, and a slice of the operating system. If the container works in testing, it works in production, because production receives the same bytes. This is a genuine engineering achievement. It has made deployment dramatically more reliable for millions of applications. But it's worth examining what Docker actually guarantees—and what it doesn't—because the answer reveals a hidden cost that many teams never consider. ## What Docker Guarantees Docker guarantees filesystem consistency. When you build an image, you capture a specific state: these files, with these contents, at these paths. When you run the image, you get that filesystem. The application sees the same libraries, the same configuration files, the same directory structure, regardless of the host. This eliminates an entire class of deployment failures. No more "missing libssl" errors. No more "wrong Python version" surprises. No more configuration drift between staging and production. Docker also provides process isolation. Your container gets its own process namespace, network namespace, and (optionally) resource limits. Other processes on the host cannot interfere with your files or memory. These guarantees are valuable. They are also the full extent of what containers provide at the deployment layer. ## What Docker Does Not Guarantee Docker does not guarantee deterministic execution. The same container, given the same inputs, may produce different outputs on different runs. This is not a bug in Docker—it's a fundamental limitation of what filesystem consistency can achieve. **Important distinction:** Docker solves *deployment* consistency (the application starts in the same state across environments). It does not solve *execution* determinism (the application executes identically given the same inputs). These are different problems at different layers of the stack. A static binary also doesn't guarantee execution determinism unless the code itself is written deterministically. The difference is that Docker solves deployment problems, while deterministic code design solves execution problems. Consider a container running a multi-threaded application. The threads are scheduled by the host kernel, not by Docker. The order of thread execution depends on CPU load, interrupt timing, and scheduler decisions that vary from moment to moment. Two runs of the same container, with the same inputs, may interleave threads differently and produce different results. Consider a container that calls `gettimeofday()` or reads `/dev/urandom`. The values returned depend on the host system's state at the moment of the call. The container filesystem is identical; the execution is not. Consider a container that allocates memory with `malloc()`. The addresses returned depend on the host's memory layout, ASLR settings, and prior allocations. Code that accidentally depends on pointer ordering will behave differently across runs. Docker guarantees that your code will *start* in the same state. It does not guarantee that your code will *execute* the same way or *finish* in the same state. ## The 450MB Question A typical Docker image for a modest application weighs hundreds of megabytes. A Python web service might ship 450MB: the application code, Python runtime, system libraries, and a Debian or Alpine base image. Of those 450MB, how much is actually *your* code? Often, less than one percent. The application itself might be a few hundred kilobytes. The rest is infrastructure: someone else's code, someone else's configuration, someone else's decisions about how to structure a Linux filesystem. This inversion—shipping 16,000 times more code than you wrote—is the hidden cost of the Docker approach. You're not shipping your application. You're shipping an entire operating system environment, frozen at a point in time, because your application cannot be trusted to run consistently without it. This comparison is deliberately rhetorical to highlight the cost difference. In practice: - Optimized containers (multi-stage builds, Alpine/scratch base) can be 10-50MB - Static binaries with realistic dependencies can be 5-20MB - The ratio changes, but the principle remains: Docker packages environment, static binaries eliminate environmental dependencies Both approaches have contexts where they're optimal. The point is not "Docker is bad" but "understand what you're trading." The question is: why can't your application run without this packaging? ## Inconsistency Has a Source Software behaves inconsistently across environments for specific, identifiable reasons: | Source | Docker Helps? | Why? | |--------|---------------|------| | Dynamic linking | ✅ | Ships consistent library versions | | Configuration files | ✅ | Ships consistent config files | | System calls (time, random) | ❌ | Not a deployment problem - needs code changes | | Non-deterministic algorithms | ❌ | Not a deployment problem - needs code changes | | Thread scheduling | ❌ | Kernel-level, outside Docker's scope | **Dynamic linking.** Your code calls functions in shared libraries. Different library versions implement those functions differently. The fix: ship the library (Docker's approach) or eliminate the dependency (static linking). **Configuration files.** Your code reads settings from files that differ between environments. The fix: ship the files (Docker's approach) or embed configuration in the binary. **System calls with environmental dependencies.** Your code asks the operating system for information—time, randomness, process IDs—that varies by host. The fix: there is no filesystem fix. Docker cannot help here—this requires code-level changes. **Non-deterministic algorithms.** Your code uses hash tables with random iteration order, threading with undefined interleaving, or memory allocation with unpredictable addresses. The fix: there is no container fix. The non-determinism is in your code. Docker addresses the first two sources by brute force: ship everything, and the filesystem becomes consistent. But it cannot address the latter three, because they are not filesystem problems. They are execution problems that require different solutions. ## The Deterministic Alternative What if, instead of shipping an environment that tolerates inconsistent code, you wrote code that executes consistently? The [c-from-scratch](https://github.com/williamofai/c-from-scratch) project demonstrates this approach. The Baseline module—a statistical anomaly detector—compiles to a 27KB static binary. It has no dynamic library dependencies. It reads no configuration files. It makes no system calls except through a narrow POSIX interface (read, write, basic file operations). This binary produces identical output on: - A developer's MacBook - A Raspberry Pi running Raspbian - An x86 server running RHEL - A CI runner in GitHub Actions - A Docker container (if you really want one) The consistency does not come from shipping a consistent environment. It comes from writing code that does not depend on environmental details. The POSIX interface is the same everywhere. The binary carries its own logic. Given the same input file, it produces the same output file, byte for byte, regardless of where it runs. ## What Determinism Requires Writing deterministic code is not free. It requires discipline in several areas: **Static linking.** Dependencies are compiled into the binary, not resolved at runtime. This increases binary size (though rarely to container scale) and requires recompilation for security updates. The tradeoff: you know exactly what code will execute. **No environmental queries.** The code does not ask "what time is it?" or "give me a random number" unless those values are provided as explicit inputs. This makes testing easier—inputs are controlled—but requires architectural changes if your current design assumes ambient access to time or randomness. **Deterministic algorithms.** Hash tables use stable iteration order. Floating-point operations are ordered consistently. Threading, if present, uses explicit synchronisation that produces the same interleaving on every run. This is the hardest requirement—it touches algorithmic choices throughout the codebase. **Narrow system interface.** The code interacts with the operating system through a minimal, well-defined interface. POSIX provides this: open, read, write, close. The less you ask of the OS, the less can vary between hosts. These requirements are achievable. They are not even unusual in certain domains. Safety-critical systems—avionics, medical devices, automotive controllers—routinely meet them because certification demands reproducible behaviour. The techniques exist; they're simply not applied to most software because most software doesn't need them. ## When Containers Make Sense To be clear: Docker is not wrong. Containers are the right solution for many, perhaps most, deployment scenarios. If your application genuinely requires a complex runtime—Python with specific package versions, Node.js with native extensions, Java with particular JVM flags—containers package that complexity efficiently. The alternative (documenting and reproducing the environment manually) is error-prone and tedious. If your deployment target is heterogeneous—some hosts run Ubuntu, others CentOS, others Amazon Linux—containers provide a common abstraction. Your application sees the same filesystem regardless of the underlying host distribution. If your team lacks control over the deployment environment—you're shipping to customer data centres, or running in a managed platform like Kubernetes—containers are the interface that platform expects. If your application's non-determinism is acceptable—most web services don't need bit-identical reproducibility—then Docker's guarantees are sufficient, and its costs are reasonable. Additionally, container orchestration platforms (Kubernetes, ECS, Cloud Run) provide operational benefits that go beyond deployment consistency: - Automatic scaling and load balancing - Rolling updates and rollbacks - Health checks and self-healing - Service discovery and networking - Resource isolation and multi-tenancy These operational benefits often justify containers even for applications that could run as static binaries. The infrastructure value exceeds the deployment value. ## When Containers Are Overhead But there are cases where containers add cost without adding value: **Deterministic applications.** If your code already produces identical results across environments, the container is packaging megabytes of redundant insurance. A static binary does not need an environmental safety blanket. **Resource-constrained deployments.** Edge devices, embedded systems, and IoT devices often lack the storage and memory for container runtimes. A static binary runs directly; a container requires infrastructure. **Certification-required systems.** Safety-critical domains require evidence of exactly what code executes. A container image contains thousands of packages, each a potential audit target. A static binary contains your code and nothing else. The certification surface area differs by orders of magnitude. **High-frequency deployments.** If you deploy hundreds of times per day, container image transfer becomes a bottleneck. A small binary transfers in milliseconds; a large image takes minutes over slow links. **Debugging production issues.** When a container behaves differently in production than in testing—despite identical images—the problem is non-determinism in your code or environment. The container obscured this; it did not prevent it. ## The Real Question Key Insight The debate is not "containers versus no containers." It's "where does consistency come from?" Docker's answer: consistency comes from packaging. Ship the environment, and the environment will be consistent. The deterministic answer: consistency comes from code. Write code that does not depend on environmental details, and the environment becomes irrelevant. Both approaches work. Both have costs. Docker's cost is visible in image sizes, transfer times, and registry bills. Determinism's cost is invisible in engineering discipline, architectural constraints, and upfront design effort. The question is which cost you'd rather pay—and whether you're paying Docker's cost because you need to, or because you never considered the alternative. ## Shipping a Bet A colleague once described Docker images as "shipping a bet that nothing upstream breaks." The image captures a moment: these library versions work together today. But libraries are maintained by other people, with other priorities, on other timelines. When a CVE arrives for a transitive dependency four layers deep in your container, you rebuild the image—and pray that the new versions still work together. Usually they do. Sometimes they don't, and you spend days debugging incompatibilities that your code did not introduce. The deterministic approach does not eliminate this risk entirely. Static binaries still depend on some system calls, and operating systems still receive updates. But the surface area is radically smaller. You depend on POSIX semantics, not on the internal implementation of libfoo version 2.3.4-ubuntu7. Smaller bets are easier to win. ## Conclusion Docker solved deployment inconsistency through containment: wrap your application in an environment, and ship the whole thing. This works. It has transformed how software is deployed, making reliable delivery accessible to teams that previously struggled with environmental chaos. But containment is not the only solution. Elimination is another. If your code does not depend on environmental details—if the same binary produces the same output on any POSIX system—then the environment does not need to be shipped. The consistency is in the code, not the container. It's worth noting that containers and deterministic code are not mutually exclusive. You can ship a deterministically-designed application in a container to get both deployment consistency AND execution determinism. The question is whether you need the container layer for your specific use case, or whether simpler deployment (static binary, systemd service, etc.) suffices. The 450MB container and the 27KB binary both represent answers to the question "how do we deploy reliably?" One ships everything to avoid thinking about dependencies. The other thinks carefully about dependencies to avoid shipping everything. Neither is universally correct. But if you've only considered the first approach, it may be worth examining the second. The discipline required to write deterministic code pays dividends beyond deployment: testability, debuggability, reproducibility, and certifiability. The hidden cost of consistent environments is the assumption that your code requires them. Sometimes it does. Sometimes—with care and discipline—it doesn't. As with any architectural approach, suitability depends on system requirements, deployment constraints, and team capabilities. But the next time you wait for a large image to transfer, it's worth asking: what, exactly, am I shipping? --- *Thanks to [Jan Tymiński](https://www.linkedin.com/in/jantyminski/) for thoughtful feedback that improved the clarity of this article's distinction between deployment and execution layers.* --- ## The Zero-Variance Problem: When Mathematics Meets Machines **URL**: https://speytech.com/insights/zero-variance-problem/ **Published**: January 11, 2026 22:50 **Topic**: Formal Methods Why perfect consistency creates impossible calculations—and how to handle it Note: This article discusses numerical edge cases in statistical monitoring. Specific implementations require validation appropriate to their deployment context. Every programmer learns to check for division by zero. It's the canonical example of defensive coding—the operation that crashes programs, corrupts calculations, and appears in every introductory textbook. Yet in statistical monitoring systems, this simple edge case manifests in a form that is both mathematically subtle and practically dangerous: the zero-variance problem. The issue arises in anomaly detection. To determine whether an observation is unusual, we often compute how many standard deviations it lies from the mean—the z-score. The formula is elegant: z = (x − μ) / σ. But when variance is zero, standard deviation σ is also zero, and the formula demands division by nothing. This is not a theoretical curiosity. It occurs in real systems, under realistic conditions, with consequences that range from crashed processes to silent failures to incorrect safety decisions. ## When Variance Vanishes Zero variance sounds impossible. Surely real-world data has some variability? But the conditions that produce it are surprisingly common. **Identical observations.** A system that heartbeats every 100 milliseconds with perfect regularity will produce a sequence: 100, 100, 100, 100, 100. The mean is 100. The variance is 0. When the first anomalous value arrives—say, 150 milliseconds—the z-score calculation divides by zero. **Insufficient data.** With a single observation, variance is undefined (or zero, depending on the formula). A monitoring system that has seen only one heartbeat cannot compute meaningful statistics. Yet the second heartbeat will arrive, and the system must respond. **Numerical underflow.** Even when true variance is positive, floating-point arithmetic can produce zero. If observations are very close together relative to machine precision, accumulated rounding errors can collapse variance to exactly 0.0. The mathematics says variance is tiny; the computer says it's nothing. **Clamped or quantised inputs.** Sensors with limited resolution, systems that round to integers, or processes that saturate at limits can produce streams of identical values. A temperature sensor stuck at its maximum reading reports the same value indefinitely. Variance: zero. Each of these scenarios is realistic. Each produces the same computational crisis: a formula that mathematics defines perfectly, but machines cannot evaluate. ## Why It Matters for Safety In ordinary software, division by zero produces an exception, a crash, or a special value like infinity or NaN. The failure is visible. Someone notices. In safety-critical monitoring, the failure modes are worse. Consider an anomaly detector protecting a cardiac rhythm monitor. The detector computes z-scores on inter-beat intervals. During a period of perfect regularity—a healthy, stable rhythm—variance approaches zero. Then an arrhythmia occurs. The interval changes dramatically. The detector should flag this as a severe anomaly. Instead, it divides by zero. What happens next depends on the implementation: - **Crash.** The monitoring process terminates. The patient is unprotected until someone restarts it. - **NaN propagation.** The z-score becomes NaN, which propagates through subsequent calculations. Comparisons with NaN return false. The anomaly is never detected. - **Infinity.** The z-score becomes positive infinity. If the threshold check uses greater-than comparison, infinity exceeds any threshold, triggering a perpetual alarm. If it uses floating-point equality, infinity may not match expected values, causing silent failure. - **Undefined behaviour.** In languages like C, integer division by zero is undefined. The program may do anything—continue with garbage values, crash, or corrupt memory. None of these outcomes is acceptable. The zero-variance problem transforms a period of perfect health—stable, regular operation—into a vulnerability that manifests precisely when anomalies occur. ## The Mathematical Reality The z-score formula assumes a population with positive variance. When variance is zero, every observation equals the mean. In this degenerate case, the concept of "standard deviations from the mean" is meaningless—there is no scale against which to measure deviation. This is not a flaw in the mathematics. It is a boundary condition where the statistical model does not apply. The formula z = (x − μ) / σ is valid when σ > 0. When σ = 0, we are outside the domain of the function. Mathematicians handle this by stating preconditions: "Let σ > 0." Programmers do not have this luxury. The input arrives regardless of whether it satisfies preconditions. The code must respond. ## Three Inadequate Solutions Before examining what works, consider what doesn't. **Assume it won't happen.** This is the most common approach, and the most dangerous. Developers reason that real data always has some variance, so the edge case is theoretical. This assumption holds until it doesn't—typically in production, under load, with consequences. **Add a small epsilon.** Replace σ with σ + ε to avoid exact zero. This creates a different problem: the choice of ε is arbitrary, and small ε values produce enormous z-scores from tiny deviations. A system might flag a 101-millisecond heartbeat (after a run of 100s) as a catastrophic anomaly because (101 - 100) / 0.0001 = 10,000 standard deviations. **Clamp the z-score.** Compute the z-score normally, then clamp results to some maximum value. This hides the symptom without addressing the cause. The z-score of 10,000 becomes 10 (or whatever the clamp), but the underlying numerical instability remains. Different observations produce the same clamped output, losing information. Each of these approaches either ignores the problem or patches it without understanding. They lead to systems that appear to work but fail under specific, reproducible conditions. ## A Principled Solution: Epistemic Honesty The correct approach is epistemic honesty: acknowledge when the calculation is not meaningful, and represent that explicitly in the output. Design Property: Guarded Computation A statistical calculation should return its result only when the computation is valid. When preconditions are not met, it should return a distinct status indicating insufficient information. In practice, this means the z-score function returns two things: a validity flag and a value. The value is meaningful only when the flag indicates success. ```c typedef struct { bool valid; double z; } zscore_result_t; zscore_result_t compute_zscore(double x, double mu, double variance) { zscore_result_t result; if (variance threshold → STABLE (if z-score normal) - variance > threshold → DEVIATION (if z-score abnormal) - variance ≤ threshold → LEARNING (stay, insufficient data) STABLE: - z-score exceeds threshold → DEVIATION - variance drops to zero → LEARNING (revert, lost statistical basis) DEVIATION: - z-score returns to normal → STABLE - variance drops to zero → LEARNING (revert) ``` The transition "variance drops to zero → LEARNING" handles the case where a previously variable signal becomes constant. The system does not crash or produce garbage; it acknowledges that it has lost the ability to make statistical judgments and reverts to learning mode. This is a total function in the technical sense: every input has a defined response. The zero-variance case is not an exception to be caught but a state to be handled. ## The EPSILON Question The guard condition uses `variance EPSILON) INV-ZV3: (variance drops below EPSILON) → (next_state == LEARNING) ``` These invariants can be verified through contract tests and fuzz testing. The c-from-scratch Baseline module includes tests that specifically target the zero-variance boundary: - Sequences of identical values - Sequences where variance decreases to zero over time - Sequences that alternate between zero and non-zero variance - Boundary cases at exactly EPSILON Fuzz testing with random sequences confirms that the invariants hold regardless of input patterns. The system never divides by zero, never produces NaN, and never crashes—because the edge case is handled by design, not by accident. ## Beyond Z-Scores The zero-variance problem is an instance of a broader pattern: mathematical formulas that are undefined at boundary conditions. Other examples in monitoring systems include: - **Logarithms of zero or negative values** — log-based metrics fail when the input is non-positive - **Ratios with zero denominators** — efficiency metrics, rates, and percentages all divide - **Inverse operations at singularities** — matrix inversion fails for singular matrices The solution pattern is the same: guard the computation, return validity information, and design the state machine to handle the "insufficient data" case explicitly. This is not defensive programming in the sense of catching errors; it is honest programming in the sense of representing what the computation can and cannot tell us. ## Conclusion The zero-variance problem illustrates a general truth: mathematics describes ideal relationships, while machines compute with finite precision under real-world constraints. The gap between mathematical elegance and computational reality is where bugs live. The z-score formula z = (x − μ) / σ is mathematically perfect. It is also computationally incomplete—it does not specify behaviour when σ = 0. That specification is the programmer's responsibility. The principled solution is not to avoid the edge case or to patch it with arbitrary constants. It is to acknowledge it explicitly: when variance is insufficient, the z-score cannot be computed, and this fact should be represented in the output. State machines that consume this output should have explicit states for "insufficient information" and transitions that handle the boundary. This approach aligns with the broader philosophy of the [c-from-scratch](https://github.com/williamofai/c-from-scratch) project: make the mathematics explicit, handle all cases, and let the code reflect what we actually know. When we know nothing—when variance is zero—the code should say so. As with any architectural approach, suitability depends on system requirements, risk classification, and numerical context. But for safety-critical monitoring where silent failures are unacceptable, epistemic honesty is not optional. The zero-variance problem is not an edge case to be dismissed; it is a boundary condition to be respected. --- ## Composition Without Compromise: Connecting FSMs **URL**: https://speytech.com/insights/fsm-composition/ **Published**: January 11, 2026 22:22 **Topic**: Formal Methods How verified components combine to create verified systems Note: This article discusses architectural patterns for composing finite state machines. While these principles inform safety-critical design, specific implementations require verification appropriate to their regulatory context. The promise of modular software is straightforward: build small, verified components, then combine them into larger systems. The reality is often disappointing. Components that work correctly in isolation interact in unexpected ways. Properties that held individually disappear when modules connect. The system becomes less than the sum of its parts. This failure mode is not inevitable. When components are designed with composition in mind—when their properties are structural rather than incidental—the resulting systems can inherit guarantees from their constituent parts. The key lies in understanding what makes composition safe. ## The Composition Problem Consider two well-tested modules: a heartbeat detector that tracks whether a system is alive, and a statistical baseline that monitors for timing anomalies. Each works correctly in isolation. Each has passing tests and verified invariants. Now connect them. The heartbeat detector outputs inter-arrival times; the baseline detector consumes them. What properties does the combination have? The honest answer, for most software, is "unclear." The modules were designed independently. Their interaction creates a new system with emergent behaviour that neither module anticipated. Edge cases multiply. Timing dependencies appear. The composition may work, but you cannot know without testing every possible interaction—a combinatorial impossibility. This uncertainty is the composition problem: connecting verified components does not automatically yield a verified system. ## Why Most Composition Fails The composition problem has deep roots. Most software modules are designed for convenience rather than composability. They carry implicit assumptions about their environment: that inputs will arrive in certain orders, that global state will be initialised, that timing constraints will be met. These assumptions work when a human orchestrates the modules carefully. They fail when modules are combined mechanically. Consider a module that caches its last result for efficiency. In isolation, this optimisation is invisible—the module returns correct outputs. But when composed with another module that expects fresh computation on every call, the cached values create subtle errors. The caching assumption (that results can be reused) conflicts with the freshness assumption (that each call triggers new computation). Neither module is wrong; their assumptions are simply incompatible. More formally, the problem is that modules have hidden state and hidden dependencies. They reach outside their defined interfaces to touch shared resources, timing mechanisms, or implicit contracts. These hidden channels create coupling that composition cannot see and therefore cannot verify. ## Three Properties That Enable Safe Composition The [c-from-scratch](https://github.com/williamofai/c-from-scratch) project takes a different approach. Every module is designed from the start with three properties that, together, enable composition without compromise. Design Property: Closure A module is closed if it depends only on its explicit inputs and internal state. No global variables. No hidden channels. No ambient dependencies. Closure eliminates hidden coupling. A closed module can be reasoned about in isolation because there is nothing outside its interface that affects its behaviour. What you pass in, and what it contains, is all there is. Design Property: Totality A module is total if it handles every possible input. No undefined behaviour. No assumptions about what inputs are "valid." Every state and every input has a defined response. Totality eliminates edge-case failures. A total module cannot be surprised by unexpected inputs because it has a response to everything. When modules are composed, the output of one becomes the input of another—and that input is handled, regardless of what it contains. Design Property: Determinism A module is deterministic if the same state and input always produce the same next state and output. No randomness. No timing dependencies. No non-determinism. Determinism eliminates interaction surprises. When you compose deterministic modules, the composite is also deterministic. The same sequence of inputs will always produce the same sequence of outputs, making the system testable and reproducible. These three properties are not optional features. They are the mathematical requirements for composition that preserves guarantees. ## A Concrete Example: The Timing Health Monitor The c-from-scratch course builds toward a composed system: the Timing Health Monitor. It combines two FSMs—Pulse (for heartbeat detection) and Baseline (for anomaly detection)—into a single system that answers: "Is this process alive and behaving normally?" ### The Pulse Component The Pulse FSM tracks liveness using three states (as explored in [Heartbeats and State Machines](/insights/heartbeats-state-machines/)): ``` Pulse States: { UNKNOWN, ALIVE, DEAD } Pulse Inputs: { BEAT, TIMEOUT } ``` When a heartbeat arrives, Pulse transitions toward ALIVE. When too much time passes without a beat, Pulse transitions toward DEAD. The module outputs inter-arrival times (Δt) between consecutive heartbeats—this becomes the input to the next module. ### The Baseline Component The Baseline FSM monitors those inter-arrival times using exponential moving averages (see [Statistics as State Transitions](/insights/statistics-as-state-transitions/)): ``` Baseline States: { LEARNING, STABLE, DEVIATION } Baseline Inputs: { Δt (timing value) } ``` During the LEARNING phase, Baseline accumulates statistics. Once sufficient data exists, it transitions to STABLE. If subsequent values deviate significantly from the established baseline, it transitions to DEVIATION. ### The Composition The composed Timing Health Monitor has four states: ``` Timing States: { INITIALIZING, HEALTHY, UNHEALTHY, DEAD } ``` The mapping from component states to composed state is a pure function: | Pulse | Baseline | → Timing | |-------|----------|----------| | DEAD | (any) | DEAD | | UNKNOWN | (any) | INITIALIZING | | ALIVE | LEARNING | INITIALIZING | | ALIVE | STABLE | HEALTHY | | ALIVE | DEVIATION | UNHEALTHY | This mapping has important properties. DEAD is absorbing—once the heartbeat is lost, nothing else matters. HEALTHY requires both existence evidence (ALIVE) and statistical evidence (STABLE). The composed state reflects a conjunction of component knowledge. ## Why This Composition Works The composition works because both Pulse and Baseline satisfy closure, totality, and determinism. Let's trace why. **Closure is preserved** because the composed system's state is simply the pair of component states. There is no additional hidden state, no global variables, no ambient dependencies. The composition function (the mapping table) is itself closed—it depends only on its inputs. **Totality is preserved** because the mapping table is complete. Every possible combination of Pulse state and Baseline state has a defined Timing state. When Pulse produces an unexpected state, the composition handles it. When Baseline produces an unexpected state, the composition handles it. The domain is the Cartesian product of component domains, and every element of that product has a defined output. **Determinism is preserved** because the composition function is deterministic and both components are deterministic. Given the same component states, the mapping always produces the same composed state. Given the same inputs to each component, they always produce the same next states. Therefore, given the same initial state and input sequence, the composed system always produces the same final state. This is not an accident. It is the direct consequence of designing components with these properties from the start. ## The Composition Function At each step, the composed system executes a straightforward algorithm: ``` timing_step(state, event, timestamp): 1. Compute Δt if we have a previous heartbeat 2. Feed event/timestamp to Pulse → Pulse updates its state → Pulse produces ALIVE/DEAD/UNKNOWN 3. If Δt is available, feed to Baseline → Baseline updates its statistics → Baseline produces LEARNING/STABLE/DEVIATION 4. Map component states to composed state → Apply the mapping table → Return INITIALIZING/HEALTHY/UNHEALTHY/DEAD ``` The order matters: data flows from event through Pulse (which produces Δt) through Baseline (which consumes Δt) to the final state. This order is determined by the data dependencies, not arbitrary convention. ## What Composition Buys You When composition preserves properties, several engineering benefits follow. **Component-level verification transfers to the system.** The invariants tested on Pulse and Baseline individually continue to hold in the composed system. If Pulse has an invariant "DEAD state is entered only after ALIVE was observed," that invariant holds in the composition. You verify components once; the verification extends. **Debugging becomes tractable.** When the composed system misbehaves, you can trace the failure to a specific component. Was the timing anomaly due to Pulse misclassifying a timeout, or Baseline miscalculating the z-score? The closed boundaries between components let you isolate faults. **System behaviour is reproducible.** Given an initial state and an input sequence, you can replay the exact behaviour. This supports post-incident analysis, regression testing, and certification evidence. See [Cryptographic Execution Tracing](/insights/cryptographic-proof-execution/) for how this property enables audit trails. **Complexity remains manageable.** The composed system has 4 states, not 3 × 3 = 9. The mapping table deliberately collapses the product space into meaningful distinctions. You reason about HEALTHY or UNHEALTHY, not about the twelve possible state combinations. ## The Algebra of Composition There is a deeper structure here. Closed, total, deterministic FSMs form a category in the mathematical sense. The objects are FSMs; the morphisms are well-typed connections between them. Composition is associative: (A ∘ B) ∘ C = A ∘ (B ∘ C). There is an identity FSM that passes inputs unchanged. This algebraic structure means you can compose arbitrarily. Build Pulse and Baseline into Timing. Compose Timing with another FSM that handles recovery. Compose that with an FSM that manages logging. At each stage, the properties are preserved, and the composed system remains closed, total, and deterministic. This is why the title says "without compromise." You do not sacrifice verification for modularity or modularity for verification. The properties are additive, not contradictory. ## Practical Implementation The c-from-scratch course implements this composition in pure C, following the [Math → Structs → Code methodology](/insights/statistics-as-state-transitions/). The composed structure contains the component structures: ```c typedef struct { pulse_t pulse; baseline_t baseline; timing_state_t state; } timing_t; ``` The step function implements the algorithm described above, with explicit error handling for edge cases like division by zero (see [The Zero-Variance Problem](/insights/zero-variance-problem/) for why this matters). Every function maintains the three properties: - All inputs come through parameters; all outputs go through returns or output pointers - Every input combination is handled, including invalid inputs - Given the same state and input, the same output is produced Contract tests verify these properties, not just happy-path behaviour. Fuzz testing with random input sequences confirms the invariants hold under conditions the developer did not imagine. ## Implications for Safety-Critical Systems This compositional approach aligns with regulatory requirements for safety-critical software. Standards like [IEC 62304](/insights/iec-62304-class-c/) and DO-178C emphasise modular design, traceability, and verification evidence. A system composed of verified components, with a verified composition, provides exactly the evidence these standards require. The [CardioCore](/cardiocore/) kernel applies these principles to cardiac rhythm monitoring. The [MDCP](/mdcp/) platform extends them to multi-core determinism. In both cases, the mathematical foundation—closed, total, deterministic FSMs that compose safely—is what enables certification-ready designs. ## Conclusion The composition problem is real: connecting verified components does not automatically yield a verified system. But it is solvable, when components are designed with composition in mind. The key is structural: modules must be closed (no hidden dependencies), total (handle all inputs), and deterministic (same inputs produce same outputs). When components satisfy these properties, their composition inherits them. Verification extends from parts to whole. This is not a theoretical nicety. It is a practical requirement for systems that must be trusted. When a pacemaker monitors cardiac rhythm, or an autonomous vehicle tracks its environment, or a satellite maintains its attitude, the composed system must work correctly—not just its individual pieces. The [c-from-scratch](https://github.com/williamofai/c-from-scratch) project teaches this approach from first principles: define the mathematics, transcribe to data structures, implement the step function, verify with contracts. The Timing Health Monitor is a concrete example of composition done right. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. But for systems where correctness matters—where "probably works" is insufficient—composition without compromise is not optional. It is the engineering discipline that makes verification tractable. --- ## Why Your ML Model Gives Different Results Every Tuesday **URL**: https://speytech.com/insights/ml-nondeterminism-problem/ **Published**: January 09, 2026 21:20 **Topic**: Deterministic Computing The hidden sources of non-determinism in machine learning, and why they matter more than you think You train a model on Monday. Validation accuracy: 94.2%. Satisfied, you go home. Tuesday morning, a colleague re-runs the same training script, same data, same hyperparameters. Validation accuracy: 93.8%. "That's within normal variance," everyone agrees. You deploy Monday's model. Six months later, during an audit, someone asks: "Can you reproduce this model?" You run the original script. Validation accuracy: 94.1%. Close, but the weights are different. The predictions on edge cases are different. The model you're auditing isn't the model you deployed. This isn't a story about sloppy engineering. This is the reality of machine learning in 2026. And for safety-critical applications—autonomous vehicles, medical devices, aerospace systems—it's becoming a serious problem. ## The Myth of Random Seeds The standard advice is simple: "Set your random seed for reproducibility." ```python import torch torch.manual_seed(42) np.random.seed(42) random.seed(42) ``` This creates a comforting illusion. The same seed should produce the same sequence of "random" numbers, which should produce the same weight initialisation, the same batch ordering, the same dropout masks—and therefore the same trained model. Except it doesn't. Setting the seed is necessary but nowhere near sufficient. There are at least seven sources of non-determinism that seeds don't control, and most ML practitioners encounter them without realising it. ## Source 1: Floating-Point Ordering Consider a simple sum: `a + b + c + d` In exact arithmetic, the order doesn't matter. In floating-point arithmetic, it absolutely does: ``` (a + b) + (c + d) ≠ ((a + c) + b) + d ``` The differences are tiny—perhaps the 15th decimal place. But neural networks perform billions of these operations during training. Tiny errors accumulate. By the end of training, weights can differ meaningfully. Modern GPUs make this worse. To maximise throughput, they process operations in parallel and accumulate results in non-deterministic order. The same operation on the same GPU can produce different results on different runs—not because of randomness, but because of scheduling variations at the microsecond level. ## Source 2: GPU Non-Determinism NVIDIA's cuDNN library—the backbone of deep learning on GPUs—uses algorithms that are non-deterministic by default. Operations like convolution and pooling have multiple valid implementation strategies. cuDNN selects between them based on runtime profiling, cache state, and GPU occupancy. You can force deterministic mode: ```python torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False ``` But this comes with a 10-30% performance penalty. Most teams disable it for training speed, accepting non-reproducibility as the cost. ## Source 3: Data Loading Parallelism Modern training pipelines load data in parallel: ```python DataLoader(dataset, num_workers=8, shuffle=True) ``` With multiple workers, the order in which batches arrive depends on operating system scheduling, disk I/O timing, and CPU load. The shuffle happens correctly, but the precise sequence of batches varies between runs. "But I set shuffle=True with a seed!" Yes, and the shuffle itself is reproducible. But which worker finishes first? That's determined by factors outside your control. ## Source 4: Batch Normalisation Statistics Batch normalisation layers compute running statistics during training: ``` running_mean = momentum * running_mean + (1 - momentum) * batch_mean ``` These statistics depend on the exact sequence of batches seen. If batch ordering varies (Source 3), running statistics vary. At inference time, the model behaves differently because its normalisation parameters are different. ## Source 5: Dropout and Stochastic Layers Dropout masks are generated from the random state. If anything perturbs that state—a different batch order, a parallel data loading race, even a print statement that triggers formatting code using random numbers—the dropout masks change. The same applies to any stochastic layer: variational layers, noise injection, stochastic depth. Each is a potential divergence point. ## Source 6: Hardware Differences Train on an A100 GPU. Deploy on a T4. Even with deterministic flags enabled, different GPU architectures have different floating-point implementations. Results will differ. Train on GPU. Run inference on CPU. The differences can be even larger—different instruction sets, different SIMD widths, different handling of denormals. "But the differences are tiny!" Yes. And "tiny" can mean a different classification on edge cases. In medical imaging, an edge case might be a tumour. ## Source 7: Library Versions PyTorch 2.0 changed the default random number generator algorithm. Models trained with PyTorch 1.x cannot be exactly reproduced with PyTorch 2.x—even with identical seeds. The "same" seed produces different numbers. cuDNN updates routinely change which algorithms are selected. CUDA versions affect floating-point behaviour. Even NumPy has changed its random number generation across versions. The model you trained 18 months ago, with pinned library versions? Good luck recreating that environment exactly. ## Why This Matters Now For research and commercial ML, non-reproducibility is an inconvenience. You run experiments multiple times and report averages. You accept some variance in production model quality. For safety-critical applications, non-reproducibility is a fundamental obstacle. **Certification Requirements** DO-178C (aerospace software) requires traceability from requirements to code to test evidence. If the same training process produces different models, which model are you certifying? Can you prove that the deployed model is the one you tested? IEC 62304 (medical device software) requires documented software development processes. If training is non-deterministic, how do you document what the process produced? How do you respond when a regulator asks to reproduce a result? ISO 26262 (automotive functional safety) requires systematic approaches to avoid unreasonable risk. How do you argue that your ML component is safe if you cannot reproduce its development? **The EU AI Act** The EU AI Act, coming into force progressively through 2025-2027, mandates technical documentation demonstrating how AI systems were developed. High-risk systems—including those in medical devices, vehicles, and critical infrastructure—require particularly rigorous documentation. Article 11 requires documentation of "the design specifications of the system, in particular the general logic of the AI system and of the algorithms." Article 12 requires "automatic recording of events" enabling tracing of the system's operation. Non-deterministic training makes both requirements difficult to satisfy. How do you document the logic of a model whose weights depend on GPU scheduling? How do you trace operation when you cannot reproduce the artefact? **Audit and Liability** When an autonomous vehicle causes an accident, lawyers will ask: "Show us the exact model that made this decision, and prove how it was trained." When a diagnostic AI misses a cancer, regulators will ask: "Reproduce the training process and show us that the deployed model matches your validation results." If you cannot reproduce training exactly, you cannot answer these questions definitively. You're left arguing that the model was "probably similar" to what you tested. That's not a strong legal position. ## The Gap The machine learning community has made extraordinary progress on model architectures, training techniques, and deployment infrastructure. Reproducibility has received comparatively little attention. This creates a gap: ML capabilities are racing ahead, while ML auditability remains stuck. The models get more powerful; the ability to certify them does not keep pace. For non-safety applications, this gap is acceptable. For safety-critical applications, it's becoming a blocker. Companies are building increasingly capable AI systems that they cannot certify for deployment in regulated environments. The question is not whether deterministic ML is desirable. It obviously is. The question is whether it's achievable. ## The Path Forward Every source of non-determinism listed above has a solution. The solutions are not trivial, but they exist. Floating-point ordering can be controlled through careful algorithm design—fixed accumulation orders, Kahan summation, or integer-based computation for critical paths. GPU non-determinism can be eliminated by avoiding non-deterministic operations or implementing deterministic alternatives. Data loading parallelism can be made deterministic through careful synchronisation and pre-computed batch orderings. Batch normalisation can use deterministic running statistics or be replaced with alternatives like layer normalisation. Dropout can use pre-computed masks derived from cryptographic hash functions rather than stateful random number generators. Hardware differences can be minimised through fixed-point arithmetic or careful floating-point discipline. Library dependencies can be controlled through rigorous environment management and, ultimately, through purpose-built frameworks that don't share codepaths with non-deterministic implementations. None of this is easy. All of it is possible. The result is ML training where: - Same data + same seed = identical model - Every weight is traceable to its initialisation and training history - Training can be replayed for audit or debugging - Certification evidence is reproducible This is what deterministic machine learning looks like. Not "mostly reproducible." Not "reproducible within tolerance." Bit-for-bit identical, on any hardware, years later. ## Implications If deterministic ML is achievable, several implications follow: **Certification becomes tractable.** You can train a model, validate it, and certify the specific artefact you validated. Reproduce it for audit. Demonstrate compliance with documentation requirements. **Debugging becomes possible.** When a model misbehaves, you can replay training to the exact step where problematic behaviour emerged. Step through the learning process. Identify causal factors. **Liability becomes manageable.** You can prove what model was deployed. Reproduce its development. Answer regulatory questions definitively rather than probabilistically. **Development becomes faster.** No more "run it five times to see if the result is real." Train once, evaluate once, trust the result. The safety-critical AI market is constrained not by model capability but by certification capability. Deterministic ML removes that constraint. ## The Question This article has described a problem. Solutions exist. The technology to build deterministic ML frameworks is available today. The question for anyone building AI for safety-critical applications: how much longer can you afford to deploy systems you cannot reproduce? --- *SpeyTech develops deterministic computing infrastructure for safety-critical systems. Our deterministic ML and RL frameworks achieve bit-perfect reproducibility across hardware platforms. Patents GB2521625.0 and GB2522369.4. For technical discussions, contact william@fstopify.com.* --- # Technical Articles — AI Architecture Production AI systems, MLOps patterns, and certifiable ML. ## The Certifiable-* Ecosystem: Eight Projects, One Deterministic ML Pipeline **URL**: https://speytech.com/ai-architecture/certifiable-ecosystem/ **Published**: January 19, 2026 23:00 **Topic**: AI Architecture From training data to deployed inference — bit-identical, auditable, certifiable There's a question that blocks AI adoption in safety-critical systems: *"Can you prove the model running on deployed hardware is exactly the same as what you tested?"* Not "similar". Not "statistically equivalent". The *same* — bit for bit, hash for hash, across different platforms, compilers, and architectures. With TensorFlow Lite, PyTorch, or ONNX Runtime, the answer is no. Floating-point arithmetic varies by platform. Hash table iteration order depends on memory allocation. Thread scheduling is inherently non-deterministic. For most applications, that doesn't matter. For aerospace, medical devices, and autonomous vehicles — where certification requires evidence, not assumptions — it's a fundamental barrier. The certifiable-* ecosystem removes that barrier. ## Eight Projects, One Pipeline The ecosystem consists of eight interconnected projects, each handling one stage of the ML pipeline: | Stage | Project | Purpose | Commitment | |-------|---------|---------|------------| | 0 | [certifiable-data](https://github.com/williamofai/certifiable-data) | Data pipeline | Merkle root of batches | | 1 | [certifiable-training](https://github.com/williamofai/certifiable-training) | Model training | Gradient chain hash | | 2 | [certifiable-quant](https://github.com/williamofai/certifiable-quant) | Quantization | Error certificate | | 3 | [certifiable-deploy](https://github.com/williamofai/certifiable-deploy) | Deployment packaging | Attestation tree | | 4 | [certifiable-inference](https://github.com/williamofai/certifiable-inference) | Forward pass | Predictions hash | | 5 | [certifiable-monitor](https://github.com/williamofai/certifiable-monitor) | Runtime monitoring | Ledger digest | | 6 | [certifiable-verify](https://github.com/williamofai/certifiable-verify) | Verification | Report hash | | — | [certifiable-harness](https://github.com/williamofai/certifiable-harness) | End-to-end orchestration | Golden reference | Every stage produces a cryptographic commitment. Every commitment chains to the next. Break any link, and verification fails. ## The Core Problem: Non-Determinism Traditional ML frameworks aren't designed for determinism. They're optimised for flexibility and performance: **Floating-point variance:** The same model produces different outputs on different CPUs due to FMA (fused multiply-add) availability, SIMD instruction selection, and compiler optimisations. **Memory allocation:** Python dictionaries, hash maps, and sets iterate in order determined by memory layout — which varies between runs. **Threading:** Parallel operations complete in unpredictable order. Reduce operations accumulate floating-point errors differently depending on execution timing. **Dynamic allocation:** malloc() returns different addresses, affecting pointer-based data structures and timing. For consumer applications, these differences are invisible. For certification, they're disqualifying. ## The Solution: Determinism by Design The certifiable-* ecosystem takes a different approach: ### Fixed-Point Arithmetic (Q16.16) Every calculation uses 32-bit fixed-point representation: - 16 bits for the integer part - 16 bits for the fractional part - Range: -32768.0 to +32767.99998 No floating-point operations anywhere in the pipeline. Same inputs produce same outputs on any platform that implements integer arithmetic correctly — which is all of them. ```c /* Q16.16 multiplication with overflow detection */ int32_t q16_mul(int32_t a, int32_t b, q16_fault_t *fault) { int64_t result = (int64_t)a * (int64_t)b; result >>= 16; if (result > Q16_MAX || result overflow = 1; return (result > 0) ? Q16_MAX : Q16_MIN; } return (int32_t)result; } ``` ### Static Allocation No malloc(). All buffers declared at compile time or allocated by the caller: ```c /* Caller provides the buffer */ void ci_forward(const ci_model_t *model, const int32_t *input, int32_t *output, /* Caller-allocated */ int32_t *workspace, /* Caller-allocated */ ci_fault_t *fault); ``` No heap fragmentation. No allocation failures. Bounded memory usage provable at compile time. ### Deterministic Algorithms - **Sorting:** Merge sort (stable, O(n log n) worst case) - **Shuffling:** Feistel network with cycle-walking (deterministic given seed) - **Hashing:** SHA-256 throughout - **Reduction:** Ordered accumulation (no parallel reduce) Every algorithm chosen for determinism first, performance second. ## Cryptographic Provenance Each stage produces a 32-byte SHA-256 commitment that includes: 1. The stage's own output 2. The previous stage's commitment This creates an unbroken chain from training data to deployed inference: ``` M_data = MerkleRoot(batch_hashes) H_train = SHA256(M_data || gradient_chain) H_cert = SHA256(H_train || quantization_certificate) R_attest = SHA256(H_cert || bundle_files) H_pred = SHA256(R_attest || predictions) L_n = SHA256(H_pred || ledger_entries) H_report = SHA256(L_n || verification_results) ``` Modify any input, and every downstream commitment changes. The chain is tamper-evident by construction. ## The Harness: Proving Bit-Identity certifiable-harness orchestrates all seven stages and compares results against a golden reference: ```bash $ ./certifiable-harness --golden reference.golden --output result.json ═══════════════════════════════════════════════════════════════ Certifiable Harness v1.0.0 Platform: x86_64 ═══════════════════════════════════════════════════════════════ [0] data ✓ (OK, 4 µs) [1] training ✓ (OK, 3 µs) [2] quant ✓ (OK, 3 µs) [3] deploy ✓ (OK, 3 µs) [4] inference ✓ (OK, 3 µs) [5] monitor ✓ (OK, 4 µs) [6] verify ✓ (OK, 8 µs) Status: ALL STAGES PASSED ✓ Bit-identical: YES ✓ ═══════════════════════════════════════════════════════════════ ``` The golden reference is a 368-byte binary containing commitments from all seven stages. Run the harness on any platform — if the hashes match, you have mathematical proof of identical execution. ### Verified Cross-Platform The harness has been tested on: | Platform | OS | Compiler | Result | |----------|-----|----------|--------| | x86_64 | Linux (Ubuntu) | GCC 12.2.0 | ✓ Bit-identical | | x86_64 | macOS 11.7 | Apple Clang | ✓ Bit-identical | Different operating systems. Different compilers. Same hashes. ## What's Implemented | Project | Tests | Key Features | |---------|-------|--------------| | certifiable-data | 142 | CSV parsing, Merkle trees, deterministic shuffle | | certifiable-training | 10 suites | Gradient descent, weight updates, chain hashing | | certifiable-quant | 134 | FP32→Q16.16, error bounds, certificates | | certifiable-deploy | 147 | Bundle format, manifest, attestation | | certifiable-inference | 8 suites | Conv2D, pooling, dense layers, activations | | certifiable-monitor | 253 | Drift detection, ledger, policy enforcement | | certifiable-verify | 10 suites | Binding verification, report generation | | certifiable-harness | 4 suites | Orchestration, golden comparison | **Total: 700+ tests across 8 projects.** ## Documentation for Certification Each project includes formal documentation designed for regulatory review: - **MATH-001** — Mathematical specification (definitions, algorithms, proofs) - **STRUCT-001** — Data structure specification (types, layouts, invariants) - **SRS-xxx** — Software requirements (traceable, testable requirements) certifiable-harness alone has 81 traceable requirements across 4 SRS documents. ## Compliance Context The ecosystem is designed to support certification under: | Standard | Domain | Key Requirements | |----------|--------|------------------| | DO-178C Level A | Aerospace | MC/DC coverage, traceability, determinism | | IEC 62304 Class C | Medical devices | Risk management, verification, documentation | | ISO 26262 ASIL-D | Automotive | Fault tolerance, diagnostic coverage | | ISO 21448 (SOTIF) | Automotive AI | Behaviour verification, edge cases | | UL 4600 | Autonomous systems | Safety case, operational design domain | Deterministic execution simplifies verification. If the same inputs always produce the same outputs, testing becomes meaningful. If you can prove cross-platform identity, deployment becomes traceable. ## The Trade-Offs This approach has costs: **Performance:** Fixed-point is slower than optimised floating-point on modern GPUs. The ecosystem is designed for edge deployment where determinism matters more than throughput. **Precision:** Q16.16 has less dynamic range than FP32. For safety-critical applications, bounded precision with known error bounds is often preferable to unbounded precision with unknown variance. **Complexity:** Eight projects is more infrastructure than dropping in TensorFlow Lite. The question is whether that infrastructure is justified by the assurance it provides. **Ecosystem:** No pre-trained models, no model zoo, no community of contributors (yet). You're building from scratch. For consumer applications, these costs aren't justified. For systems where certification is mandatory and determinism is required, the alternative is often "don't use ML at all." ## Getting Started Clone any project and run the tests: ```bash git clone https://github.com/williamofai/certifiable-inference.git cd certifiable-inference mkdir build && cd build cmake .. make ctest --output-on-failure ``` For end-to-end verification: ```bash git clone https://github.com/williamofai/certifiable-harness.git cd certifiable-harness mkdir build && cd build cmake .. make # Generate golden reference ./certifiable-harness --generate-golden --output result.json # Verify (should show Bit-identical: YES) ./certifiable-harness --golden result.json.golden --output verify.json ``` ## What This Enables When a regulator asks "how do you know the deployed model is the same as what you tested?", the answer changes: **Before:** "We have a careful deployment process." **After:** "Here's a 368-byte golden reference. Run it on the deployed hardware. If the seven SHA-256 hashes match, the execution is mathematically identical. If they don't, I can tell you exactly which stage diverged." That's a different kind of answer. ## Repositories | Project | URL | |---------|-----| | certifiable-data | https://github.com/williamofai/certifiable-data | | certifiable-training | https://github.com/williamofai/certifiable-training | | certifiable-quant | https://github.com/williamofai/certifiable-quant | | certifiable-deploy | https://github.com/williamofai/certifiable-deploy | | certifiable-inference | https://github.com/williamofai/certifiable-inference | | certifiable-monitor | https://github.com/williamofai/certifiable-monitor | | certifiable-verify | https://github.com/williamofai/certifiable-verify | | certifiable-harness | https://github.com/williamofai/certifiable-harness | All projects are GPL-3.0 licensed. Commercial licensing available for organisations requiring proprietary deployment. --- The certifiable-* ecosystem represents one approach to deterministic ML. As with any architectural choice, suitability depends on system requirements, risk classification, and regulatory context. The goal isn't to replace general-purpose ML frameworks — it's to enable ML in domains where those frameworks can't currently go. **UK Patent Application GB2521625.0** — Murray Deterministic Computing Platform --- ## A Complete Deterministic ML Pipeline for Safety-Critical Systems **URL**: https://speytech.com/ai-architecture/deterministic-ml-pipeline/ **Published**: January 19, 2026 00:15 **Topic**: AI Architecture From training data to deployed inference — bit-identical, auditable, certifiable Production ML systems fail for boring reasons. Not exotic model architecture problems — infrastructure problems. Non-deterministic data loading. Floating-point variance across hardware. Training runs that can't be reproduced. Deployed models that don't match what was tested. For most applications, these are annoyances. For safety-critical systems — medical devices, autonomous vehicles, aerospace — they're disqualifying. You can't certify software that behaves differently each time it runs. This article describes a complete ML pipeline designed from first principles for deterministic execution and safety-critical certification. Five open-source projects, 771 tests, and a single invariant: **same input → same output, always**. ## The Problem with Standard ML Pipelines A typical ML pipeline has non-determinism at every stage: **Data loading**: Shuffle order depends on PRNG state. Multi-threaded loaders return batches in unpredictable order. Floating-point normalisation varies by platform. **Training**: Gradient reduction order affects floating-point accumulation. Dropout masks aren't reproducible. Optimizer state depends on execution order. **Quantization**: Error bounds unknown. Calibration statistics non-reproducible. No proof that quantized model preserves behaviour. **Deployment**: No binding between certified model and deployed artifact. No verification that weights weren't corrupted in transit. **Inference**: Different results on different hardware. Memory allocation timing affects cache state. No audit trail linking prediction to model version. Each of these is a solvable problem. But solving them requires designing the entire pipeline around determinism, not bolting it on afterward. ## The Certifiable Pipeline The solution is five projects that share common principles: ``` certifiable-data → certifiable-training → certifiable-quant → certifiable-deploy → certifiable-inference ``` Each project is pure C99, uses zero dynamic allocation, and produces bit-identical output across platforms. ### certifiable-data **Problem**: Standard data loaders introduce non-determinism through random shuffling, floating-point normalisation, and multi-threaded batch assembly. **Solution**: Data loading as a pure function. Given dataset D, seed s, epoch e, and batch index t, the output is deterministic: ``` B_t = Pipeline(D, s, e, t) ``` Key components: - **Feistel shuffling**: Cryptographic permutation that maps any index to its shuffled position in O(1) time. No sequential state, no execution-order dependence. - **Fixed-point normalisation**: Q16.16 arithmetic with precomputed inverse standard deviation. No floating-point division. - **Merkle provenance**: Every epoch produces a cryptographic commitment linking dataset hash, configuration, and shuffle seed. The same (dataset, seed, epoch, batch) produces the same samples, in the same order, with the same normalisation, forever. *Related reading: [The ML Non-Determinism Problem](/insights/ml-nondeterminism-problem/), [Bit-Perfect Reproducibility](/insights/bit-perfect-reproducibility/)* ### certifiable-training **Problem**: Training loops accumulate floating-point rounding errors non-deterministically. Gradient reduction order affects results. There's no audit trail proving what happened during training. **Solution**: Fixed-point gradients, deterministic reduction topology, and Merkle-chained step verification. Key components: - **Q8.24 gradients**: Higher precision for gradient accumulation, explicit rounding at each operation. - **Compensated summation**: Neumaier algorithm tracks rounding error, making large reductions accurate. - **Fixed reduction tree**: Binary tree topology eliminates execution-order dependence. - **Merkle chain**: Every training step produces a hash linking previous state, gradients, and new weights. Any step is independently verifiable. The training loop becomes a deterministic state machine. Given initial weights and training data, the final weights are reproducible bit-for-bit. *Related reading: [From Proofs to Code: Mathematical Transcription in C](/insights/mathematical-proofs-to-code/), [Cryptographic Execution Tracing](/insights/cryptographic-proof-execution/)* ### certifiable-quant **Problem**: Standard quantizers (TensorFlow Lite, ONNX) are black boxes. Error bounds unknown. No proof that quantized model preserves behaviour. Calibration non-reproducible. **Solution**: Provable FP32→Q16.16 quantization with formal error certificates. Key components: - **Theoretical analysis**: Compute error bounds *before* quantization. Overflow proofs, range propagation, Lipschitz constants. - **Empirical calibration**: Collect activation statistics with coverage metrics and degenerate detection. - **Verified conversion**: Quantize with explicit error tracking, verify against theoretical bounds. - **Cryptographic certificate**: Merkle root linking analysis, calibration, and verification digests. The output is a certificate proving the quantized model is within ε of the original, auditable forever. *Related reading: [Fixed-Point Neural Networks: The Math Behind Q16.16](/insights/fixed-point-neural-networks/), [Closure, Totality, and the Algebra of Safe Systems](/insights/closure-totality-algebra/)* ### certifiable-deploy **Problem**: How do you prove the deployed model matches what was certified? How do you verify weights weren't tampered with? How do you bind artifacts to specific hardware? **Solution**: Canonical bundle format with cryptographic attestation and target binding. Key components: - **CBF v1 format**: Deterministic container with no ambient metadata. Same content = same bytes = same hash. - **Merkle attestation**: 4-leaf tree binding manifest, weights, certificates, and inference artifacts. - **JCS manifest**: RFC 8785 canonical JSON for deterministic serialization. - **Target binding**: Lock bundles to specific platforms (arch-vendor-device-abi). - **Runtime loader**: Fail-closed state machine. Inference API unreachable without completing verification chain. The loader implements "Execution ⇒ Verification": no inference without cryptographic proof that the loaded model matches the certified model. *Related reading: [DO-178C Level A Certification](/insights/do178c-certification/), [The Real Cost of Dynamic Memory](/insights/dynamic-memory-safety-critical/)* ### certifiable-inference **Problem**: Different hardware produces different results. Memory allocation affects timing. No audit trail linking predictions to model versions. **Solution**: Integer-only forward pass with static allocation and deterministic execution paths. Key components: - **Fixed-point arithmetic**: Q16.16 for all weights and activations. No floating-point in the critical path. - **Static buffers**: All memory provided by caller. No malloc in inference loop. - **Bounded execution**: Fixed iteration counts, no data-dependent branching, WCET-analysable. The same input produces the same output on x86, ARM, and RISC-V. Bit-identical, every time. *Related reading: [WCET for Neural Network Inference](/ai-architecture/wcet-neural-network-inference/), [IEC 62304 Class C Requirements](/insights/iec-62304-class-c/)* ## The Three Theorems Every project in the certifiable family satisfies three properties: | Theorem | Statement | Implication | |---------|-----------|-------------| | **Bit Identity** | F_A(s) = F_B(s) for any DVM-compliant platforms A, B | Cross-platform reproducibility | | **Bounded Error** | Error saturates, doesn't accumulate | Predictable behaviour | | **Auditability** | Any operation verifiable in O(1) time | Scalable verification | These aren't aspirations — they're tested properties. 771 tests across 39 test suites verify these invariants hold. ## What This Enables ### Reproducible Research Training runs that produce identical results, always. No "worked on my machine" failures. No unexplained variance between runs. ### Certification Evidence DO-178C Level A requires complete requirements traceability. IEC 62304 Class C requires validated transformations. ISO 26262 ASIL-D requires provable behaviour. This pipeline provides the evidence these standards demand. ### Incident Investigation When something goes wrong, you can replay exact inputs through exact model versions and get exact outputs. No "we can't reproduce the failure" mysteries. ### Audit Trails Cryptographic proof linking every prediction to: the deployed bundle, the quantization certificate, the training Merkle chain, and the data provenance. Years later, you can prove exactly how that prediction was made. ## Test Coverage The pipeline is backed by 771 tests across 39 test suites: | Project | Tests | Suites | |---------|-------|--------| | certifiable-data | 133 | 8 | | certifiable-training | 223 | 10 | | certifiable-quant | 150 | 7 | | certifiable-deploy | 201 | 7 | | certifiable-inference | 64 | 7 | | **Total** | **771** | **39** | These aren't smoke tests. They verify bit-identical output across platforms, correct handling of edge cases, and conformance to formal specifications. ## Getting Started Each project builds with CMake: ```bash git clone https://github.com/williamofai/certifiable-{name} cd certifiable-{name} mkdir build && cd build cmake .. make make test ``` Documentation includes mathematical foundations (CT-MATH-001), data structure specifications (CT-STRUCT-001), and Software Requirements Specifications with full traceability. ## When You Don't Need This This level of rigour has costs: more complex implementation, restricted arithmetic, verbose documentation. If you're building a recommendation system or a chatbot, standard frameworks are fine. You need this when: - Regulatory standards require reproducible behaviour (DO-178C, IEC 62304, ISO 26262) - Failures have safety implications - You need to prove what happened, not just log it - Cross-platform consistency matters ## Further Reading **Core Concepts:** - [Bit-Perfect Reproducibility: Why It Matters and How to Prove It](/insights/bit-perfect-reproducibility/) - [The ML Non-Determinism Problem](/insights/ml-nondeterminism-problem/) - [From Proofs to Code: Mathematical Transcription in C](/insights/mathematical-proofs-to-code/) **Safety-Critical Standards:** - [DO-178C Level A Certification](/insights/do178c-certification/) - [IEC 62304 Class C: What Medical Device Software Actually Requires](/insights/iec-62304-class-c/) - [ISO 26262 and ASIL-D: The Role of Determinism](/insights/iso-26262-asil-d-determinism/) **Infrastructure Patterns:** - [The Real Cost of Dynamic Memory in Safety-Critical Systems](/insights/dynamic-memory-safety-critical/) - [Cryptographic Execution Tracing and Evidentiary Integrity](/insights/cryptographic-proof-execution/) - [WCET for Neural Network Inference](/ai-architecture/wcet-neural-network-inference/) **Production ML:** - [The Floating-Point Trap](/ai-architecture/floating-point-danger/) - [TFLite and DO-178C: The Certification Gap](/ai-architecture/tflite-do178c-challenges/) - [The Observability Gap in ML Systems](/ai-architecture/ml-observability-gap/) --- Context: This pipeline is designed for safety-critical applications with specific regulatory requirements. For most ML applications, standard frameworks provide better tooling, larger ecosystems, and faster development. Evaluate the trade-offs in your specific context. The goal isn't complexity for its own sake. It's providing the evidence that safety-critical certification demands. When a regulator asks "prove this model behaves deterministically," you need more than good intentions — you need 771 tests and cryptographic audit trails. All projects are open source under GPL-3.0, with commercial licensing available for proprietary safety-critical systems. [View Projects on GitHub](https://github.com/williamofai) · [Technical Documentation](https://speytech.com/open-source/) --- ## WCET Analysis for Neural Network Inference **URL**: https://speytech.com/ai-architecture/wcet-neural-network-inference/ **Published**: January 15, 2026 22:31 **Topic**: AI Architecture How to prove worst-case execution time for convolution, matrix multiply, and pooling operations Safety-critical systems must respond within guaranteed time bounds. A pacemaker that usually delivers a signal on time is not acceptable. An aircraft flight control system that occasionally misses a deadline is not certifiable. The system must always meet its timing requirements, including under worst-case conditions. Worst-Case Execution Time (WCET) analysis provides the mathematical foundation for these guarantees. By analysing code structure, loop bounds, and hardware behaviour, WCET analysis determines the maximum time a computation can take. This upper bound feeds into schedulability analysis, ensuring that all tasks complete before their deadlines. Neural network inference presents specific challenges for WCET analysis. The computations are regular and predictable in structure, which is favourable. But they are also computationally intensive, making tight bounds essential for practical systems. Loose bounds waste capacity; tight bounds require careful analysis. This article explains WCET analysis techniques applied to neural network operations: convolution, matrix multiplication, activation functions, and pooling. The goal is practical guidance for engineers who need to fill in the timing section of a certification package. ## WCET Analysis Fundamentals WCET analysis determines an upper bound on execution time without running the code on all possible inputs. Two complementary approaches exist: **Static analysis** examines the code structure and hardware model to compute bounds mathematically. It requires no test execution but needs accurate models of both software and hardware. **Measurement-based analysis** runs the code on representative inputs and measures execution time. It provides concrete data but cannot guarantee that the worst case was observed. Hybrid approaches combine both: static analysis for structure, measurement for hardware timing. For certification, the analysis must be justified as sound, meaning the computed bound is genuinely an upper bound on all possible executions. ### Loop Bounds The foundation of static WCET analysis is determining how many times each loop executes. For neural network operations, loop bounds derive from tensor dimensions: ```c // Loop bounds are explicit in dimensions for (int oh = 0; oh threshold) { expensive_operation(); } // Data-independent timing - preferred result = (value > threshold) ? a : b; // Both paths similar cost ``` Neural network operations with [fixed-point arithmetic](/insights/fixed-point-neural-networks/) are naturally data-independent. Integer multiply takes the same time regardless of operand values on most processors. This property simplifies WCET analysis considerably. Design Property: Timing Predictability Code with data-independent timing and static loop bounds enables tight WCET analysis. The execution time is determined by code structure, not input values. ## Analysing Convolution 2D convolution is typically the most expensive operation in CNN inference. Its nested loop structure makes WCET analysis straightforward in principle but requires attention to memory access patterns. ### Loop Structure Standard convolution has four nested loops: ```c void fx_conv2d(const fixed_t* input, const fixed_t* kernel, fixed_t* output, int in_h, int in_w, int k_h, int k_w) { int out_h = in_h - k_h + 1; int out_w = in_w - k_w + 1; for (int oh = 0; oh > 16); } } } ``` ### Iteration Count Total inner loop iterations: ``` N_iter = out_h × out_w × k_h × k_w = (in_h - k_h + 1) × (in_w - k_w + 1) × k_h × k_w ``` For a 16×16 input with 3×3 kernel: ``` N_iter = 14 × 14 × 3 × 3 = 1,764 iterations ``` ### Per-Iteration Cost The inner loop body contains: - 2 additions (ih, iw calculation) - 2 multiplications (index calculations) - 2 memory loads (input, kernel) - 1 widening multiply (64-bit) - 1 addition (accumulator) On a representative 32-bit microcontroller (ARM Cortex-M4): | Operation | Cycles (approx) | |-----------|-----------------| | Integer add | 1 | | Integer multiply | 1 | | Memory load (cache hit) | 1-2 | | Memory load (cache miss) | 10-50+ | | 64-bit multiply | 3-5 | | 64-bit add | 1-2 | Assuming cache hits, inner loop: ~10-15 cycles per iteration. ### Total Bound ``` WCET_conv = N_iter × cycles_per_iter + loop_overhead = 1,764 × 15 + overhead ≈ 26,500 cycles ``` At 100 MHz: ~265 μs This is a rough bound. Precise analysis requires: - Exact instruction sequence from compiled code - Processor pipeline model - Cache behaviour analysis - Memory access timing ### Cache Considerations Cache behaviour significantly affects WCET for convolution. The access pattern determines hit rates: **Kernel accesses**: The kernel is small (9 elements for 3×3) and accessed repeatedly. After initial loads, kernel accesses hit cache consistently. **Input accesses**: Input access pattern has spatial locality within rows but jumps between rows. For small inputs that fit in cache, this works well. For large inputs, cache misses occur at row boundaries. **Output accesses**: Output is written sequentially, which is cache-friendly. Conservative WCET analysis assumes cache misses. Tighter analysis models the access pattern to prove hit rates. ## Analysing Matrix Multiplication Dense layers use matrix multiplication. The analysis follows similar principles. ### Loop Structure ```c void fx_matmul(const fixed_t* A, const fixed_t* B, fixed_t* C, int M, int K, int N) { for (int i = 0; i > 16); } } } ``` ### Iteration Count ``` N_iter = M × N × K ``` For a layer with 256 inputs and 128 outputs (M=1, K=256, N=128): ``` N_iter = 1 × 128 × 256 = 32,768 iterations ``` ### Memory Access Pattern Matrix B is accessed with stride N, which can cause cache thrashing for large N. The standard mitigation is loop tiling, but this complicates WCET analysis by introducing additional loop bounds. For safety-critical systems, simpler untiled implementations with conservative cache assumptions may be preferred over optimised implementations with complex timing behaviour. ## Analysing Activation Functions Activation functions apply element-wise operations. Their WCET is proportional to tensor size. ### ReLU ```c void fx_relu(fixed_t* data, int size) { for (int i = 0; i TABLE_MAX) return SIGMOID_MAX; // Table lookup with interpolation int index = (x - TABLE_MIN) >> SHIFT; fixed_t frac = (x - TABLE_MIN) & MASK; fixed_t y0 = sigmoid_table[index]; fixed_t y1 = sigmoid_table[index + 1]; return y0 + fx_mul(frac, y1 - y0); } ``` The branches create multiple paths. WCET is the maximum across all paths: ``` WCET_sigmoid = max(clamp_low_path, clamp_high_path, interpolation_path) ``` The interpolation path is typically longest, including table lookups and multiplication. ## Analysing Pooling Max pooling compares elements within windows. The comparison count is fixed by window size. ### 2×2 Max Pooling ```c void fx_maxpool_2x2(const fixed_t* input, fixed_t* output, int in_h, int in_w) { int out_h = in_h / 2; int out_w = in_w / 2; for (int oh = 0; oh max_val) max_val = v1; fixed_t v2 = input[(ih + 1) * in_w + iw]; if (v2 > max_val) max_val = v2; fixed_t v3 = input[(ih + 1) * in_w + iw + 1]; if (v3 > max_val) max_val = v3; output[oh * out_w + ow] = max_val; } } } ``` ### Iteration Count ``` N_iter = out_h × out_w = (in_h / 2) × (in_w / 2) ``` ### Per-Iteration Cost Each iteration: - 4 memory loads - 3 comparisons - 3 conditional moves (or branches) - 1 memory store - Index calculations The comparisons have data-dependent branches, but both outcomes have identical cost (conditional move or store to max_val). WCET is the same regardless of which element is maximum. ## Composing Layer WCET Network WCET is the sum of layer WCETs: ``` WCET_network = WCET_conv1 + WCET_relu1 + WCET_pool1 + WCET_conv2 + WCET_relu2 + WCET_pool2 + WCET_dense + WCET_softmax ``` This assumes sequential execution. Pipelined or parallel implementations require more sophisticated analysis. ### Example: Small CNN Consider a simple CNN for image classification: | Layer | Dimensions | Iterations | Est. Cycles | |-------|------------|------------|-------------| | Conv 3×3 | 28×28→26×26 | 60,840 | 912,600 | | ReLU | 26×26 | 676 | 2,028 | | MaxPool 2×2 | 26×26→13×13 | 169 | 2,535 | | Conv 3×3 | 13×13→11×11 | 10,890 | 163,350 | | ReLU | 11×11 | 121 | 363 | | MaxPool 2×2 | 11×11→5×5 | 25 | 375 | | Dense | 25→10 | 250 | 3,750 | | **Total** | | | **~1,085,000** | At 100 MHz: ~10.85 ms per inference. This is a rough estimate. Actual WCET requires precise analysis of compiled code on the target processor. ## Measurement-Based Verification Static analysis provides bounds; measurement provides validation. The two should be consistent. ### Test Methodology ```c void measure_wcet(void) { fixed_t input[INPUT_SIZE]; fixed_t output[OUTPUT_SIZE]; uint32_t max_cycles = 0; for (int trial = 0; trial max_cycles) { max_cycles = cycles; } } printf("Measured max: %u cycles\n", max_cycles); printf("Static bound: %u cycles\n", STATIC_WCET_BOUND); // Measured should be less than static bound assert(max_cycles inference.asm ``` Analysis tools work on object code or binary, not source. ### Hardware Variability Even deterministic code has timing variation from hardware: - Cache state at function entry - Pipeline state - Memory refresh cycles (DRAM) - Interrupt latency (if interrupts enabled) WCET analysis must account for these. Conservative analysis assumes worst-case cache state. Precise analysis models cache contents. ### Tooling Commercial WCET analysis tools include: - AbsInt aiT - Rapita RapiTime - Bound-T These tools combine static analysis with hardware models to compute bounds. For certification, tool qualification may be required. For simpler systems, manual analysis with spreadsheets and measurement validation may suffice. ## Documentation for Certification DO-178C and similar standards require documented evidence. WCET documentation should include: **Analysis method**: Static, measurement-based, or hybrid. Justification for soundness. **Assumptions**: Processor model, clock speed, cache configuration, interrupt policy. **Loop bounds**: How each loop bound was determined. Proof that bounds are correct. **Computed bounds**: WCET for each function and the complete inference path. **Measurement data**: Test methodology, number of trials, observed distribution. **Margin**: Difference between computed WCET and deadline. Justification that margin is sufficient. This documentation becomes part of the certification package, providing evidence that timing requirements are met. ## Implementation Reference The [certifiable-inference](https://github.com/williamofai/certifiable-inference) project includes timing verification: - Loop bounds documented for all operations - Static allocation eliminates malloc timing variance - Measurement benchmarks for validation - Demonstrated <5% jitter at 95th percentile The implementation is designed for analysability, with simple loop structures and [data-independent timing](/insights/bit-perfect-reproducibility/). ## Conclusion WCET analysis for neural network inference follows established techniques: determine loop bounds, compute per-iteration costs, and sum across the network. The regularity of neural network operations makes them amenable to analysis, provided the implementation avoids data-dependent timing and dynamic allocation. The key requirements are: - **Static loop bounds** derived from tensor dimensions - **Data-independent timing** from fixed-point arithmetic and branchless code - **Simple memory patterns** that enable cache analysis - **Measurement validation** confirming that static bounds hold For safety-critical systems, WCET analysis is not optional. The timing section of a certification package must demonstrate that inference completes within its deadline under all conditions. The techniques in this article provide a foundation for that demonstration. As with any certification effort, the analysis must be appropriate for the specific system, processor, and assurance level. Simple systems may use manual analysis; complex systems may require commercial tools. The goal is justified confidence that timing requirements are met. --- *For an implementation designed for WCET analysability, see [certifiable-inference](https://github.com/williamofai/certifiable-inference) or try the [live simulator](https://inference.speytech.com/).* --- ## Why TensorFlow Lite Faces Challenges in DO-178C Certification **URL**: https://speytech.com/ai-architecture/tflite-do178c-challenges/ **Published**: January 15, 2026 20:45 **Topic**: AI Architecture Understanding the architectural properties that complicate aerospace certification for mobile inference frameworks Note: This article examines general architectural patterns in mobile inference frameworks and how they interact with aerospace certification requirements. It is not a comprehensive certification analysis, nor does it represent the position of any certification authority. Actual certification decisions depend on the specific system, intended use, and assessor interpretation. TensorFlow Lite is used as a representative example; similar considerations may apply to other inference frameworks. Aerospace software certification under DO-178C imposes rigorous requirements on how software is designed, verified, and documented. These requirements evolved over decades of aviation safety experience and are mandatory for software whose failure could affect flight safety. Mobile inference frameworks like TensorFlow Lite were designed with different priorities: flexibility, performance across diverse hardware, and ease of use for researchers and app developers. These are legitimate engineering goals that have made deep learning accessible to millions of applications. However, the architectural choices that enable flexibility and broad hardware support can create challenges when the same frameworks are considered for aerospace applications. Understanding these challenges helps teams make informed decisions about whether to adapt existing frameworks, build custom inference engines, or pursue hybrid approaches. This article examines specific architectural properties common in mobile inference frameworks and explains why they can complicate DO-178C certification efforts, particularly at the higher Design Assurance Levels (DAL A and DAL B). ## DO-178C Requirements Overview DO-178C establishes objectives for software development based on the severity of potential failures. Design Assurance Level A applies when software failure could cause or contribute to catastrophic failure conditions. Level B applies to hazardous conditions. Levels C, D, and E apply to progressively less severe scenarios. At DAL A, the standard requires: **Requirements traceability.** Every software requirement must trace to system requirements, and every line of code must trace to a software requirement. The purpose of each code element must be documented and justified. **Verification coverage.** Testing must achieve structural coverage (MC/DC for DAL A), meaning every decision outcome and condition combination must be exercised. Dead code is prohibited. **Deterministic behaviour.** While DO-178C does not explicitly use the word "deterministic," its objectives effectively require predictable, reproducible behaviour. Evidence must demonstrate that software performs its intended functions under all operational conditions. **Configuration control.** Every artefact affecting the software must be under configuration control with documented history. These objectives are achievable but require significant discipline in design and documentation. ## Architectural Properties of Mobile Inference Frameworks Mobile inference frameworks share common architectural patterns that optimise for their primary use cases. The following analysis uses publicly documented behaviour; specific implementation details may vary across versions. ### Dynamic Memory Allocation Mobile frameworks typically allocate memory dynamically during model loading and inference: ```cpp // Typical pattern in inference frameworks TfLiteTensor* tensor = interpreter->tensor(index); // Tensor data may be allocated lazily or resized ``` Dynamic allocation creates verification challenges at high DALs: **Variable timing.** Allocation time depends on heap state, creating unpredictable worst-case execution time (WCET). DAL A systems typically require bounded timing analysis. **Fragmentation risk.** Long-running systems may experience heap fragmentation, causing allocation failures that are difficult to reproduce during testing. **Coverage complexity.** Allocation failure paths must be tested. With many allocation points, achieving structural coverage of all failure paths can be challenging. The CAST-21 position paper from certification authorities addresses dynamic memory, noting that demonstrating compliance requires specific measures beyond typical development practices. ### Hardware Abstraction Layers Frameworks abstract hardware differences to support multiple platforms: ```cpp // Framework selects implementation based on hardware if (cpu_supports_neon()) { neon_conv2d(input, kernel, output); } else if (cpu_supports_sse()) { sse_conv2d(input, kernel, output); } else { reference_conv2d(input, kernel, output); } ``` This flexibility creates certification considerations: **Multiple code paths.** Each hardware backend is effectively a different implementation. Complete verification requires testing each path the deployed system might execute. **Platform dependence.** Behaviour may vary across platforms in subtle ways. Demonstrating equivalence requires evidence that all backends produce acceptable results. **Conditional complexity.** Runtime hardware detection adds branches that must be covered and justified. For certification, the target hardware configuration is typically fixed, so unused backends could potentially be excluded. However, this requires modifying the framework build or demonstrating that excluded code is truly unreachable. ### Floating-Point Computation Neural network inference frameworks predominantly use floating-point arithmetic: ```cpp // Standard floating-point convolution for (int i = 0; i < output_size; i++) { float sum = 0.0f; for (int j = 0; j < kernel_size; j++) { sum += input[i + j] * kernel[j]; // FP multiply-accumulate } output[i] = sum; } ``` Floating-point creates reproducibility challenges: **Platform variance.** Different processors implement floating-point with different intermediate precision, FMA availability, and rounding behaviour. The same computation may produce slightly different results on different hardware. **Compiler effects.** Optimisation flags can change operation ordering, affecting results due to floating-point non-associativity. **Verification complexity.** If results vary across platforms, test cases cannot use exact expected values. Tolerance-based testing requires justification that the tolerance is acceptable for the application. For safety-critical applications, some teams choose [fixed-point arithmetic](/insights/fixed-point-neural-networks/) to eliminate floating-point variance. This requires model quantisation and validation of accuracy loss. ### Third-Party Dependencies Frameworks depend on external libraries for math operations, threading, and platform services: ``` TensorFlow Lite dependencies (representative): ├── Eigen (linear algebra) ├── FlatBuffers (serialization) ├── gemmlowp (quantized matrix multiply) ├── ruy (matrix multiply) ├── XNNPACK (neural network operators) └── platform-specific libraries ``` Dependencies affect certification in several ways: **Verification scope.** All code that executes in the certified system must be verified to an appropriate level. Third-party libraries require the same evidence as first-party code. **Change management.** Updates to dependencies require re-verification. Frameworks with frequent releases and many dependencies create configuration management challenges. **Traceability.** Requirements for third-party code may not exist or may not meet aerospace documentation standards. Some certification approaches use "previously developed software" (PDS) provisions for stable, well-characterised libraries. This requires evidence of the library's service history and suitability for the application. ### Code Size and Complexity Mobile inference frameworks optimise for feature coverage rather than minimal footprint: ``` Component sizes (representative, version-dependent): - TensorFlow Lite core: ~1MB compiled - Operator kernels: ~2-5MB depending on included ops - Dependencies: variable, potentially several MB ``` Large codebases affect certification economics: **Verification cost.** Testing and documentation effort scales with code size. Structural coverage of millions of lines of code is expensive. **Dead code.** Features unused by the target application may remain in the binary unless explicitly excluded. Dead code is prohibited at DAL A. **Review burden.** Code reviews for certification require understanding each component's purpose and behaviour. Custom inference engines for specific models can achieve much smaller footprints by including only required operations. This reduces verification scope but requires custom development. ## Quantifying the Challenge To illustrate the scale, consider a hypothetical certification effort for a small neural network using a mobile inference framework: | Aspect | Typical Framework | Custom Engine | |--------|-------------------|---------------| | Binary size | 2-5 MB | 20-100 KB | | Source lines (approx) | 100K-500K | 2K-10K | | External dependencies | 5-15 | 0-2 | | Hardware backends | 3-10 | 1 | | Memory allocation | Dynamic | Static | | Structural coverage scope | All executed code | All code | These numbers are illustrative; actual figures depend on specific configurations. The key observation is that verification effort correlates with code complexity and variability. ## Approaches Teams Have Taken Organisations pursuing aerospace AI have adopted various strategies: ### Framework Modification Some teams fork existing frameworks and modify them for certification: - Remove unused operators and backends - Replace dynamic allocation with static buffers - Add traceability documentation - Isolate and justify third-party dependencies This approach preserves compatibility with trained models but requires substantial engineering effort and ongoing maintenance as upstream frameworks evolve. ### Custom Implementation Other teams build inference engines from scratch: - Include only required operations - Design for verification from the start - Eliminate unnecessary variability - Achieve small, auditable codebases This approach reduces verification scope but requires implementing and validating each neural network operation. ### Hybrid Approaches Some teams use frameworks during development and custom engines for deployment: - Train models using standard frameworks (TensorFlow, PyTorch) - Export weights and architecture - Implement inference in a certifiable custom engine - Validate equivalence between framework and custom outputs This preserves the benefits of mature training infrastructure while deploying a minimal, certifiable inference engine. ### Formal Methods Emerging approaches apply formal verification to neural network implementations: - Prove properties of fixed-point arithmetic implementations - Verify absence of runtime errors through static analysis - Demonstrate bounded execution time through formal timing analysis These techniques are maturing but show promise for reducing testing burden. ## What Certification Authorities Consider Certification authorities (DERs in the US, OSD in Europe) evaluate each case individually. Factors they may consider include: **Safety analysis.** How do inference failures affect system safety? What mitigations exist? Lower-criticality applications may accept more framework complexity. **Operational constraints.** Is the system continuous or bounded in operation? Short mission times may reduce concerns about fragmentation and memory exhaustion. **Verification evidence.** What testing has been performed? How comprehensive is the coverage? Strong verification evidence can sometimes offset architectural complexity. **Service history.** Has this framework been used successfully in similar applications? Service history can support qualification arguments. **Development assurance.** Was the framework developed to aerospace standards? Most general-purpose frameworks were not, which affects what evidence is available. The outcome depends on the specific system, its safety role, and the persuasiveness of the certification argument. ## Implications for Project Planning Teams considering neural network inference for aerospace applications should: **Assess early.** Certification implications should inform architecture decisions at project start, not be discovered during verification. **Engage authorities.** Early engagement with certification authorities (through certification plans and issue papers) can clarify expectations before major investment. **Consider alternatives.** Custom inference engines may have higher initial development cost but lower verification cost. The trade-off depends on model complexity and reuse expectations. **Budget realistically.** Certifying complex software at high DALs is expensive regardless of approach. Underestimating verification effort is a common project failure mode. **Plan for maintenance.** Aerospace software lifecycles span decades. Architectures that simplify updates and re-verification reduce lifetime cost. ## Conclusion Mobile inference frameworks like TensorFlow Lite represent excellent engineering for their intended applications. Their architectural choices—dynamic allocation, hardware abstraction, floating-point computation, and extensive dependencies—enable broad compatibility and ease of use. These same properties can complicate certification under DO-178C, particularly at higher Design Assurance Levels. The challenges are not insurmountable, but they require significant effort to address: modifying frameworks, creating extensive verification evidence, or building custom implementations. Teams pursuing aerospace AI should understand these challenges early and choose approaches that align with their certification strategy. For some applications, adapting existing frameworks may be cost-effective. For others, purpose-built inference engines may reduce overall certification effort. The aerospace industry continues to develop best practices for certifiable AI. As experience accumulates and tools mature, clearer patterns will emerge. In the meantime, understanding the interaction between framework architecture and certification requirements helps teams make informed decisions. As with any certification effort, success depends on early planning, realistic assessment, and close coordination with certification authorities. The challenges are significant but not unprecedented—aerospace has successfully certified complex software before, and neural network inference will eventually follow established paths to certification. --- *For an example of an inference architecture designed with certification in mind, see [certifiable-inference](https://github.com/williamofai/certifiable-inference), which demonstrates fixed-point arithmetic, static allocation, and minimal dependencies. A [live simulator](https://inference.speytech.com/) shows the approach in action.* --- ## Why Floating Point Is Dangerous: The Case for Deterministic AI in C **URL**: https://speytech.com/ai-architecture/floating-point-danger/ **Published**: January 14, 2026 20:00 **Topic**: AI Architecture When 'mostly reproducible' isn't good enough for systems that matter ## The Problem Nobody Talks About Consider a medical device company developing an ML model to detect cardiac abnormalities. The model passes all tests—95% accuracy, low latency, no memory leaks. It runs perfectly in validation. During safety review, they discover something troubling: given identical ECG data, the model sometimes classifies the signal as normal and sometimes as abnormal. The same input. Different outputs. They can't reproduce specific predictions reliably. The model isn't wrong. It's **non-deterministic**. This scenario—or variations of it—represents a common challenge in safety-critical ML development. An FDA submission gets delayed while teams debug reproducibility issues. An automotive supplier discovers their perception model behaves differently across hardware platforms. A financial system produces inconsistent risk scores for the same customer data. The symptoms differ, but the root cause is the same: standard ML infrastructure was built for research, not for systems where reproducibility is a regulatory requirement. Modern ML infrastructure was built for research, not for systems where "mostly reproducible" isn't good enough. ## The Three Sources of Non-Determinism ML models fail to reproduce for three reasons. Two are well-known. The third is the one that actually causes production failures. ### 1. Floating Point Arithmetic (The Obvious One) Everyone knows floating point math is imprecise. `0.1 + 0.2 ≠ 0.3` in binary. Rounding errors accumulate. Operations aren't associative: `(a + b) + c ≠ a + (b + c)`. What's less obvious: **compiler optimization changes the results**. ```c // Without optimization float sum = 0.0f; for (int i = 0; i > 16); } // Example: 2.5 * 3.75 fixed_t a = (2 name, ((feature_t*)b)->name); } void process_features(feature_t* features, size_t count) { qsort(features, count, sizeof(feature_t), compare_features); // Now iteration order is deterministic for (size_t i = 0; i used + size > POOL_SIZE) { return NULL; // Explicit failure } void* ptr = pool->pool + pool->used; pool->used += size; return ptr; } // Allocation addresses are now deterministic // Same allocation sequence = same addresses = same behavior ``` ## What This Looks Like in Practice A deterministic neural network inference engine: ```c // Model state: fixed size, pre-allocated typedef struct { fixed_t weights[MAX_WEIGHTS]; fixed_t activations[MAX_ACTIVATIONS]; memory_pool_t pool; uint32_t layer_count; } model_t; // Initialize with bounded resources model_t* model_init(const uint8_t* model_data, size_t model_size) { model_t* m = calloc(1, sizeof(model_t)); if (!m) return NULL; // Load weights into fixed-point format load_weights(m, model_data, model_size); // Pre-allocate all activation memory pool_init(&m->pool, sizeof(m->activations)); return m; } // Inference: deterministic, bounded void model_predict(model_t* m, const fixed_t* input, fixed_t* output) { // Reset pool to initial state (deterministic allocation) pool_reset(&m->pool); // Forward pass: exact fixed-point ops for (uint32_t i = 0; i layer_count; i++) { layer_forward(m, i, input); } // Copy result memcpy(output, m->activations + output_offset, output_size); } ``` **Properties guaranteed:** - Same input → same output (bit-for-bit identical) - Bounded memory (no dynamic allocation) - Bounded time (no data-dependent loops with unbounded iteration) - No undefined behavior (every operation is specified) This is certifiable. This is reproducible. This is what safety-critical systems require. ## The Performance Question "But fixed-point is slower than floating point!" Sometimes yes. Often no. Depends on the target. **On CPUs with FPUs:** Floating point might be faster (hardware acceleration). **On microcontrollers without FPUs:** Fixed-point is dramatically faster (no software floating point emulation). **On embedded DSPs:** Fixed-point is native (that's what they're designed for). **For cache-sensitive workloads:** Fixed-point uses less memory (int32_t vs float = same size, but fixed-point models often quantize to int16_t or int8_t). More importantly: **determinism enables optimizations floating point can't do.** When behavior is guaranteed, you can: - Pre-compute lookup tables - Prove loop bounds for unrolling - Guarantee cache behavior - Enable aggressive optimizations that would be unsafe with floating point The real performance win isn't raw FLOPS. It's **predictability**. When you know exactly how long every operation takes, you can schedule with confidence. No worst-case margin. No jitter. Just deterministic timing. ## When Floating Point Is Fine Most ML doesn't need this. Research models, recommendation systems, content classification—non-determinism is acceptable. The cost of determinism exceeds the benefit. **Use floating point when:** - Approximate results are acceptable - You're not certifying for safety - Development speed matters more than reproducibility - You need the full dynamic range - You're not debugging production issues **Consider determinism when:** - Debugging is expensive (hard to reproduce issues) - Compliance requires proof of correctness - Systems are long-lived (must behave identically for years) - Liability is significant (people can be harmed) - Trust matters (financial, medical, legal decisions) ## The Path Forward Writing deterministic ML infrastructure in C isn't exotic. It's how embedded systems have done signal processing for decades. Digital filters, FFTs, control loops—all deterministic, all fixed-point, all certifiable. The techniques exist. The standards exist. What's missing is applying them to modern ML. **The opportunity:** Build inference engines that guarantee reproducibility. Create tools that enable certification. Solve the problems that prevent ML from being deployed in systems that matter. **The challenge:** Convenience vs. correctness. Floating point is easier. Determinism requires discipline. **The reality:** For systems where failure has consequences, "easier" isn't enough. Autonomous vehicles, medical devices, industrial control systems, financial infrastructure—these aren't research projects. They're production systems where "mostly reproducible" is unacceptable. The industry needs deterministic ML infrastructure built by people who understand both the mathematics and the constraints. People who've shipped safety-critical systems. People who know the difference between "it works" and "it provably works." If you're building ML for systems that matter, floating point isn't just dangerous—it's a liability you can't afford. ## Related Reading For more on production ML infrastructure: - [Production AI Systems: What 30 Years of UNIX Taught Me](/ai-architecture/production-ai-unix-principles/) - Infrastructure principles that enable reproducibility - [Debugging Model Behavior in Production](/ai-architecture/debugging-model-behavior-production/) - How non-determinism makes debugging impossible - [The Observability Gap in ML Systems](/ai-architecture/ml-observability-gap/) - Why traditional monitoring misses non-deterministic failures --- *Building deterministic ML infrastructure? I'm working on open-source tools to solve exactly these problems. [Get in touch](/contact/) if you're facing these challenges in production systems.* --- ## Debugging Model Behavior in Production **URL**: https://speytech.com/ai-architecture/debugging-model-behavior-production/ **Published**: January 13, 2026 21:40 **Topic**: AI Architecture When the model works in staging but fails in prod, here's how to find out why ## The Symptoms The model worked perfectly in development. Accuracy was 94%. Latency was 50ms. Integration tests passed. Staging looked good. Then you deploy to production. Within an hour, customers report incorrect predictions. You check the dashboard - accuracy is 67%. Some predictions are returning null. P99 latency is 8 seconds. You roll back. Everything returns to normal. You try deploying again the next day. Same problems. The model itself hasn't changed. The code hasn't changed. But something is different between staging and production. This is the most frustrating type of ML failure: behavior that appears only in production and can't be reproduced in development. ## The Investigation Process Debugging production model behavior follows a pattern. Work through these steps systematically. ### Step 1: Capture the Failing Input You need the exact input that caused the problem. Not a similar input. Not a representative sample. The specific input that failed. ```python # Add input capture on prediction failures def predict_with_capture(model, input_data, request_id): try: result = model.predict(input_data) return result except Exception as e: # Capture failing input failing_input = { 'request_id': request_id, 'input_data': sanitize_for_storage(input_data), 'error': str(e), 'timestamp': time.time(), 'model_version': model.version } # Store for later analysis store_failing_input(failing_input) log.error("prediction_failed", request_id=request_id, error=str(e), input_hash=hash_input(input_data) ) raise def sanitize_for_storage(input_data): """Remove PII but keep structure and statistics""" if isinstance(input_data, dict): return { k: sanitize_value(v) for k, v in input_data.items() } return input_data def sanitize_value(value): """Replace sensitive values while preserving type/range""" if isinstance(value, str): return f"" elif isinstance(value, (int, float)): return f"" # Approximate return value ``` **Why this works:** You can't debug what you can't reproduce. Capturing the failing input lets you reproduce the failure locally. **Common mistake:** Only logging aggregated statistics. "5% of predictions failed" tells you nothing about which inputs failed or why. ### Step 2: Check Input Distribution Shift Production data often differs from training data in subtle ways. ```python # Compare production input distributions to training def analyze_input_distribution(production_inputs, training_inputs): """Compare key statistics between production and training""" stats = {} for field in production_inputs[0].keys(): prod_values = [inp[field] for inp in production_inputs] train_values = [inp[field] for inp in training_inputs] if isinstance(prod_values[0], (int, float)): stats[field] = { 'prod_mean': mean(prod_values), 'train_mean': mean(train_values), 'prod_std': std(prod_values), 'train_std': std(train_values), 'prod_min': min(prod_values), 'train_min': min(train_values), 'prod_max': max(prod_values), 'train_max': max(train_values), } # Flag significant differences mean_diff = abs(stats[field]['prod_mean'] - stats[field]['train_mean']) if mean_diff > stats[field]['train_std']: stats[field]['warning'] = 'mean_shift' elif isinstance(prod_values[0], str): prod_unique = set(prod_values) train_unique = set(train_values) stats[field] = { 'prod_unique_count': len(prod_unique), 'train_unique_count': len(train_unique), 'new_values': prod_unique - train_unique, 'missing_values': train_unique - prod_unique } if stats[field]['new_values']: stats[field]['warning'] = 'unseen_categories' return stats # Run this analysis periodically prod_inputs = load_recent_production_inputs(hours=24) train_inputs = load_training_sample() distribution_stats = analyze_input_distribution(prod_inputs, train_inputs) # Alert on significant shifts for field, stats in distribution_stats.items(): if 'warning' in stats: log.warning("distribution_shift", field=field, warning=stats['warning'], details=stats ) ``` **What to look for:** - Numeric features outside training range - Categorical features with unseen values - Skewed distributions (prod mean far from training mean) - Missing or null values not present in training **Real example:** A fraud detection model failed because production transactions included a new `payment_method` value ("buy_now_pay_later") that wasn't in training data. Model had no learned behavior for this value. ### Step 3: Reproduce in Isolation Take the failing input and run it through the model in a controlled environment. ```python # Reproduce failure locally def reproduce_failure(failing_input_record): """Attempt to reproduce a production failure locally""" # Load exact model version from production model = load_model( version=failing_input_record['model_version'] ) # Reconstruct input input_data = failing_input_record['input_data'] # Reproduce prediction try: result = model.predict(input_data) print(f"Local prediction succeeded: {result}") print("This suggests environment difference, not model issue") return { 'reproduced': False, 'local_result': result } except Exception as e: print(f"Local prediction failed: {e}") print("Error reproduced - this is a model/input issue") # Analyze the failure analyze_prediction_failure(model, input_data, e) return { 'reproduced': True, 'error': str(e) } def analyze_prediction_failure(model, input_data, error): """Deep dive into why prediction failed""" # Check input validity print("\n=== Input Validation ===") for key, value in input_data.items(): print(f"{key}: type={type(value)}, value={value}") # Check for NaN/Inf if isinstance(value, float): if math.isnan(value): print(f" WARNING: {key} is NaN") if math.isinf(value): print(f" WARNING: {key} is Inf") # Check feature preprocessing print("\n=== Feature Preprocessing ===") try: features = model.preprocess(input_data) print(f"Preprocessing succeeded: {features}") except Exception as e: print(f"Preprocessing failed: {e}") print("Issue is in feature engineering, not model inference") return # Check model inference print("\n=== Model Inference ===") try: output = model.forward(features) print(f"Model inference succeeded: {output}") except Exception as e: print(f"Model inference failed: {e}") print("Issue is in model execution") ``` **If it reproduces locally:** Problem is in the model or input. Proceed to Step 4. **If it doesn't reproduce locally:** Problem is environmental (dependencies, resources, data sources). Proceed to Step 5. ### Step 4: Isolate the Layer Models have layers: preprocessing, feature engineering, inference, postprocessing. Find which layer fails. ```python # Test each layer independently def isolate_failing_layer(model, input_data): """Determine which model layer causes failure""" layers = [] # Layer 1: Input validation try: validated = model.validate_input(input_data) layers.append(('validation', 'pass', validated)) except Exception as e: layers.append(('validation', 'fail', str(e))) return layers # Can't proceed # Layer 2: Feature extraction try: features = model.extract_features(validated) layers.append(('feature_extraction', 'pass', features)) except Exception as e: layers.append(('feature_extraction', 'fail', str(e))) return layers # Layer 3: Preprocessing (scaling, encoding) try: preprocessed = model.preprocess(features) layers.append(('preprocessing', 'pass', preprocessed)) except Exception as e: layers.append(('preprocessing', 'fail', str(e))) return layers # Layer 4: Model inference try: raw_output = model.forward(preprocessed) layers.append(('inference', 'pass', raw_output)) except Exception as e: layers.append(('inference', 'fail', str(e))) return layers # Layer 5: Postprocessing try: final_output = model.postprocess(raw_output) layers.append(('postprocessing', 'pass', final_output)) except Exception as e: layers.append(('postprocessing', 'fail', str(e))) return layers return layers # Run isolation analysis layers = isolate_failing_layer(model, failing_input) for layer_name, status, result in layers: print(f"{layer_name}: {status}") if status == 'fail': print(f" Error: {result}") print(f" Issue is in {layer_name} layer") break ``` **Common failure points:** **Preprocessing:** Scaling/normalization with unexpected input ranges ```python # Fails if input outside training range scaled = (value - mean) / std # std might be 0 for constant features ``` **Feature extraction:** Missing or malformed fields ```python # Fails if field doesn't exist age = input_data['user_age'] # KeyError if missing ``` **Inference:** NaN or Inf propagation through network ```python # NaN inputs create NaN outputs output = model(features) # Silently propagates NaN ``` ### Step 5: Check Environmental Differences If the failure doesn't reproduce locally, the environment differs between staging and production. ```python # Compare staging vs production environments def compare_environments(): """Capture environment details for comparison""" import sys import platform env_info = { # Python environment 'python_version': sys.version, 'platform': platform.platform(), # Package versions 'tensorflow_version': tf.__version__, 'numpy_version': np.__version__, 'pandas_version': pd.__version__, # System resources 'cpu_count': os.cpu_count(), 'available_memory_gb': psutil.virtual_memory().available / 1e9, # Model file hash (verify model is identical) 'model_file_hash': hash_file('model.pt'), # Configuration 'model_config': model.get_config(), # Data sources 'feature_db_host': os.getenv('FEATURE_DB_HOST'), 'feature_db_version': get_db_version(), } return env_info # Capture in both environments staging_env = compare_environments() production_env = compare_environments() # Find differences differences = [] for key in staging_env: if staging_env[key] != production_env[key]: differences.append({ 'key': key, 'staging': staging_env[key], 'production': production_env[key] }) if differences: print("Environment differences found:") for diff in differences: print(f" {diff['key']}:") print(f" Staging: {diff['staging']}") print(f" Production: {diff['production']}") ``` **Common environmental causes:** **Version mismatches:** TensorFlow 2.10 vs 2.12 - subtle behavior changes **Resource constraints:** Staging has 8GB RAM, production has 4GB - OOM failures **Data sources:** Staging uses cached data, production queries live database - latency/content differences **Concurrency:** Staging is single-threaded, production is multi-threaded - race conditions ### Step 6: Enable Prediction Logging For issues that are intermittent or rare, enable detailed prediction logging to capture context when failures occur. ```python # Detailed prediction logging class PredictionLogger: def __init__(self, sample_rate=0.1): self.sample_rate = sample_rate def should_log(self, is_error=False): """Always log errors, sample successes""" if is_error: return True return random.random() < self.sample_rate def log_prediction(self, input_data, output, error=None, metadata=None): """Log prediction with full context""" if not self.should_log(is_error=error is not None): return log_entry = { 'timestamp': time.time(), 'input_hash': hash_input(input_data), 'input_stats': compute_input_stats(input_data), 'output': output if error is None else None, 'error': str(error) if error else None, 'metadata': metadata or {}, 'environment': { 'model_version': metadata.get('model_version'), 'host': socket.gethostname(), 'memory_usage_mb': get_memory_usage(), } } # Store for analysis prediction_log.write(log_entry) # Use in serving logger = PredictionLogger(sample_rate=0.1) def predict(input_data, request_id): metadata = {'request_id': request_id, 'model_version': model.version} try: output = model.predict(input_data) logger.log_prediction(input_data, output, metadata=metadata) return output except Exception as e: logger.log_prediction(input_data, None, error=e, metadata=metadata) raise ``` **Analysis queries:** ```python # Find patterns in failures failures = prediction_log.query( "SELECT * FROM predictions WHERE error IS NOT NULL" ) # Group by error type error_counts = failures.groupby('error').size() print("Most common errors:") print(error_counts.sort_values(ascending=False)) # Find input patterns that fail for error_type in error_counts.index[:5]: error_inputs = failures[failures['error'] == error_type] print(f"\n=== {error_type} ===") print("Input characteristics:") print(error_inputs['input_stats'].describe()) ``` ## Common Production-Only Failures ### Scale-Dependent Failures Problems that only appear at production traffic levels. **Memory leaks:** Small per-request leak becomes catastrophic at 1000 req/sec ```python # Leaked memory accumulates cache = {} # Never cleared cache[request_id] = result # Grows forever ``` **Resource exhaustion:** Connection pools, file handles, GPU memory ```python # Runs out after 1000 requests db_connection = create_connection() # Never closed ``` **Mitigation:** Load testing at production scale, resource monitoring, connection pooling ### Data-Dependent Failures Problems triggered by specific data patterns that appear rarely. **Adversarial inputs:** Unusual combinations not in training data ```python # Model never saw age=150 prediction = model.predict({'age': 150, 'income': 50000}) ``` **Edge cases:** Extreme values, empty lists, null handling ```python # Division by zero on empty list avg_purchase = sum(purchases) / len(purchases) # len=0 in prod ``` **Mitigation:** Input validation, defensive coding, edge case testing ### Timing-Dependent Failures Problems that depend on request timing or state. **Race conditions:** Multiple requests modifying shared state ```python # Not thread-safe if cache.get(key) is None: cache[key] = expensive_computation() # Race condition return cache[key] ``` **Stale caches:** Features cached too long, out of sync with model ```python # Feature cached 1 hour ago, model updated 30 minutes ago features = feature_cache.get(user_id) # Stale ``` **Mitigation:** Thread-safe code, cache invalidation, versioning ## The Debugging Toolkit Essential tools for production model debugging: **1. Request replay:** Capture and replay production requests locally **2. Diff tool:** Compare staging vs production environments **3. Input profiler:** Analyze input distributions over time **4. Layer inspector:** Step through model layers with real inputs **5. Prediction logs:** Comprehensive logging with sampling (see [The Observability Gap in ML Systems](/ai-architecture/ml-observability-gap/)) These tools, combined with the systematic process above, make most production failures debuggable. ## The Unsexy Truth Most production model failures aren't exotic ML problems. They're boring software engineering problems: - Missing input validation - Unhandled edge cases - Version mismatches - Resource constraints - Race conditions The ML model itself is usually fine. The infrastructure around it is broken. Fix the infrastructure using standard debugging techniques. The systematic process above works because it treats model serving as a software system, not as magic. When your model works in staging but fails in production, it's telling you something about the difference between those environments. Listen to it. Capture the failing inputs. Reproduce the failure. Isolate the cause. Fix the infrastructure. Then add tests to prevent regression. The same failure pattern shouldn't surprise you twice. ## Related Reading For more on production ML infrastructure: - [The Observability Gap in ML Systems](/ai-architecture/ml-observability-gap/) - What to log to make debugging possible - [Production AI Systems: What 30 Years of UNIX Taught Me](/ai-architecture/production-ai-unix-principles/) - Infrastructure principles including failure handling - [Model Serving Architecture Patterns](/ai-architecture/ai-model-serving-patterns/) - Architecture decisions that affect debuggability --- ## When You Don't Need a Feature Store **URL**: https://speytech.com/ai-architecture/when-you-dont-need-feature-store/ **Published**: January 13, 2026 21:15 **Topic**: AI Architecture Most teams solve a problem they don't have yet ## The Pattern A team builds their first production ML model. It works. Then someone asks: "Should we use a feature store?" The question implies the answer. Feature stores are standard MLOps infrastructure. Every mature ML organization has one. The vendors say so. The conference talks recommend them. Not having a feature store feels like technical debt. So the team spends three months evaluating Feast, Tecton, and Databricks Feature Store. Another two months integrating the chosen solution. Another month debugging why features aren't matching between training and serving. Six months later, they're serving predictions from a feature store that recomputes features on every request - exactly what they were doing before, but with more complexity and latency. This pattern repeats constantly. Feature stores solve real problems. But most teams don't have those problems yet. ## What Feature Stores Actually Solve Feature stores solve three specific problems: **Problem 1: Training-Serving Skew** When training uses different feature computation logic than serving. The model trains on `sum(purchases_last_30_days)` but serves with `sum(purchases_last_month)` - different results, model breaks. **Problem 2: Feature Recomputation** When multiple models need the same features. Computing `user_lifetime_value` independently for each model wastes resources. **Problem 3: Point-in-Time Correctness** When training needs historical feature values. For a prediction made on 2024-06-15, what was `user_tier` on that date? Naive joins use current values, introducing data leakage. These are real problems. If you have them, feature stores help. But you might not have them yet. ## When You Don't Need a Feature Store ### You Have One Model If you have a single model, training-serving skew is easy to avoid without infrastructure: ```python # features.py - single source of truth def compute_features(user_data, transaction_data): """Used by both training and serving""" return { 'total_purchases': len(transaction_data), 'avg_purchase_value': mean([t.amount for t in transaction_data]), 'days_since_last_purchase': (today() - max([t.date for t in transaction_data])).days, # ... more features } # training.py features = compute_features(user_data, transactions) model.train(features, labels) # serving.py features = compute_features(user_data, transactions) prediction = model.predict(features) ``` This works. It's simple. It's maintainable. Skew is impossible - same code path for both. **Why this works:** With one model, feature logic fits in one file. No coordination needed. No shared infrastructure required. **When this breaks:** When you have 10 models and each reimplements `compute_features()` slightly differently. Now you have skew risk and maintenance burden. ### Your Features Are Request-Scoped If features only use data in the request, there's nothing to store: ```python # Request contains everything needed @app.post("/predict") def predict(request: PredictRequest): features = { 'transaction_amount': request.amount, 'merchant_category': request.merchant_category, 'is_international': request.country != 'US', 'hour_of_day': datetime.now().hour, } return model.predict(features) ``` **Why this works:** No historical data needed. No precomputation needed. Feature store would add latency without benefit. **When this breaks:** When you need `user_average_transaction_amount` or `merchant_fraud_rate` - data not in the request. Now you need storage. ### You Can Tolerate Batch Predictions If predictions can be computed overnight and cached, feature stores are overkill: ```python # Nightly batch job def compute_all_predictions(): users = load_all_users() for user in users: features = compute_features(user) prediction = model.predict(features) cache.set(f"prediction:{user.id}", prediction) # Serving just reads cache @app.get("/prediction/{user_id}") def get_prediction(user_id: str): return cache.get(f"prediction:{user_id}") ``` **Why this works:** Features computed once per day. Predictions cached. Serving is just cache lookup. No online feature computation needed. **When this breaks:** When predictions need to be real-time based on latest data. Now you need online features. ### Your Training Data Is Small If your training dataset fits in memory, point-in-time correctness is a SQL query: ```python # Training with point-in-time correctness training_data = db.query(""" SELECT u.user_id, u.created_at, COUNT(t.id) as num_transactions, AVG(t.amount) as avg_transaction FROM events e JOIN users u ON e.user_id = u.user_id LEFT JOIN transactions t ON t.user_id = u.user_id AND t.timestamp < e.timestamp -- Point-in-time correctness WHERE e.label IS NOT NULL GROUP BY u.user_id, u.created_at """) ``` **Why this works:** Database handles point-in-time joins. No feature store materialization needed. Results are fast enough for typical model training. **When this breaks:** When you have billions of training examples and complex feature joins. Now the SQL query takes hours. Feature store precomputation becomes necessary. ## What to Use Instead If you don't need a feature store, use simpler alternatives: ### Alternative 1: Shared Feature Functions ```python # features/user_features.py def compute_user_features(user_id: str, as_of: datetime = None): """Compute user features for training or serving Args: user_id: User identifier as_of: Timestamp for point-in-time correctness (training) If None, uses current time (serving) """ as_of = as_of or datetime.now() transactions = db.query( "SELECT * FROM transactions WHERE user_id = ? AND timestamp < ?", user_id, as_of ) return { 'num_transactions': len(transactions), 'total_spent': sum(t.amount for t in transactions), 'avg_transaction': mean([t.amount for t in transactions]), 'days_since_last': (as_of - max([t.timestamp for t in transactions])).days } # Training uses as_of for point-in-time correctness train_features = [ compute_user_features(ex.user_id, as_of=ex.timestamp) for ex in training_examples ] # Serving uses current time serve_features = compute_user_features(request.user_id) ``` **Advantages:** - Training-serving skew impossible (same code) - Point-in-time correctness handled - No new infrastructure - Easy to debug (just Python) **Disadvantages:** - Repeated computation (no caching across models) - Slow for many models or large-scale training ### Alternative 2: Cached Aggregations ```python # Precompute expensive features, cache results class FeatureCache: def __init__(self, cache_ttl_seconds=300): self.cache = {} self.ttl = cache_ttl_seconds def get_user_features(self, user_id: str): cache_key = f"user_features:{user_id}" # Check cache if cache_key in self.cache: cached_value, timestamp = self.cache[cache_key] if time.time() - timestamp < self.ttl: return cached_value # Compute and cache features = self._compute_user_features(user_id) self.cache[cache_key] = (features, time.time()) return features def _compute_user_features(self, user_id): # Expensive computation here return compute_features(user_id) # Use in serving feature_cache = FeatureCache(cache_ttl_seconds=300) @app.post("/predict") def predict(request: PredictRequest): features = feature_cache.get_user_features(request.user_id) return model.predict(features) ``` **Advantages:** - Fast serving (cache hits avoid computation) - No infrastructure beyond Redis/Memcached - TTL controls freshness - Works for multiple models **Disadvantages:** - Cache invalidation complexity - No point-in-time correctness for training - Need to handle cache misses ### Alternative 3: Materialized Views ```python # Database-native feature materialization db.execute(""" CREATE MATERIALIZED VIEW user_features AS SELECT user_id, COUNT(*) as num_transactions, SUM(amount) as total_spent, AVG(amount) as avg_transaction, MAX(timestamp) as last_transaction_date FROM transactions GROUP BY user_id """) # Refresh periodically (e.g., hourly) db.execute("REFRESH MATERIALIZED VIEW user_features") # Training queries the view train_features = db.query(""" SELECT u.*, f.* FROM training_examples u JOIN user_features f ON u.user_id = f.user_id """) # Serving queries the view serve_features = db.query( "SELECT * FROM user_features WHERE user_id = ?", user_id ) ``` **Advantages:** - Database-native (no new systems) - Fast reads (precomputed) - SQL-based (familiar tools) - Works for moderate scale **Disadvantages:** - Refresh lag (data staleness) - Less flexible than code - Doesn't scale to billions of features ## When You Actually Need a Feature Store You need a feature store when: **1. Multiple teams, many models** When 5 teams are building 20 models and all need `user_lifetime_value`. Reimplementing it 20 times creates skew risk and maintenance burden. **2. Real-time features at scale** When you need sub-100ms serving with features computed from terabytes of data. Materialized views and caches don't scale to this. **3. Complex point-in-time correctness** When training requires accurate historical feature values across dozens of feature types with different update frequencies. **4. Feature reuse is proven valuable** When you measure that 80% of features are shared across models. Not when you hope they might be shared someday. **5. Feature computation is expensive** When computing features costs more than storing them. For example, complex aggregations over streaming data. At this point, feature store infrastructure pays for its complexity. Before this point, it's premature optimization. ## The Migration Path If you start simple and later need a feature store, migration is straightforward: **Phase 1: Shared functions (current state)** ```python def compute_features(user_id): # Compute on demand return features ``` **Phase 2: Add caching** ```python def compute_features(user_id): cached = cache.get(f"features:{user_id}") if cached: return cached features = _compute(user_id) cache.set(f"features:{user_id}", features, ttl=300) return features ``` **Phase 3: Separate computation from serving** ```python # Background job precomputes features def precompute_features(): for user_id in active_users(): features = compute_features(user_id) feature_store.write(user_id, features) # Serving reads precomputed features def get_features(user_id): return feature_store.read(user_id) ``` **Phase 4: Add feature store** ```python # Now using Feast/Tecton/etc features = feature_store.get_online_features( entity_rows=[{"user_id": user_id}], features=["user_lifetime_value", "transaction_count"] ) ``` Each phase works independently. You only move to the next phase when current phase's limitations become painful. ## The Unsexy Truth Feature stores solve real problems. But those problems appear at scale, not at the start. Most teams building their first few models don't have: - Dozens of models competing for feature computation resources - Terabytes of feature data requiring specialized storage - Complex point-in-time correctness requirements across teams What they have: - One or two models - Features that fit in a database - Team small enough to coordinate in Slack For these teams, a feature store is complexity without benefit. Shared functions and basic caching solve the same problems with less infrastructure. Build the simple thing first. Add complexity when you have evidence you need it. You'll know when that time comes - your team will be spending more time working around the limitations of simple approaches than they would spend adopting a feature store. Until then, skip it. ## Related Reading For more on infrastructure decisions in production ML: - [Production AI Systems: What 30 Years of UNIX Taught Me](/ai-architecture/production-ai-unix-principles/) - Principles for avoiding premature complexity - [Model Serving Architecture Patterns](/ai-architecture/ai-model-serving-patterns/) - When to choose simple vs complex serving architectures --- ## Model Serving Architecture Patterns **URL**: https://speytech.com/ai-architecture/ai-model-serving-patterns/ **Published**: January 13, 2026 20:35 **Topic**: AI Architecture Understanding latency, throughput, and the trade-offs between them ## The Question Nobody Asks "How should we serve our models?" Most teams reach for the first solution that works: wrap the model in Flask, throw it behind a load balancer, call it done. This works until it doesn't. Then someone asks: "Why is P99 latency 5 seconds when P50 is 50ms?" Or: "Why can we only handle 10 requests per second?" Or: "Why did one slow request take down the entire service?" These aren't model problems. They're architecture problems. The way you structure model serving determines what's possible and what's impossible. ## The Three Fundamental Patterns Every model serving architecture is a variation on three basic patterns. Each optimizes for different constraints. Each has different failure modes. ### Pattern 1: Single-Process Serving The simplest pattern: one process loads one model, serves all requests sequentially. ```python # Single-process server class ModelServer: def __init__(self, model_path): self.model = load_model(model_path) def predict(self, request): return self.model.predict(request.data) # Flask/FastAPI wrapper app = FastAPI() server = ModelServer("model.pt") @app.post("/predict") def predict_endpoint(request: PredictRequest): return server.predict(request) ``` **What this optimizes for:** - Simplicity (minimal code, easy to reason about) - Memory efficiency (one model copy in memory) - Consistent behavior (no parallelism surprises) **Where this breaks:** - **Throughput**: Sequential processing caps requests/second - **Latency**: One slow request blocks all others (head-of-line blocking) - **Availability**: Process crash = service down **When to use it:** - Low traffic (= self.batch_size: asyncio.create_task(self._process_batch()) return await future async def _process_batch(self): """Process accumulated requests as batch""" if not self.pending: return # Extract batch batch = self.pending[:self.batch_size] self.pending = self.pending[self.batch_size:] ids, inputs, futures = zip(*batch) # Batch inference try: batch_input = stack_inputs(inputs) batch_output = self.model.predict(batch_input) # Distribute results for req_id, output, future in zip(ids, batch_output, futures): future.set_result(output) except Exception as e: # Fail all requests in batch for future in futures: future.set_exception(e) async def _batch_timeout_worker(self): """Process partial batches after timeout""" while True: await asyncio.sleep(self.timeout_ms / 1000) if self.pending: asyncio.create_task(self._process_batch()) ``` **What this optimizes for:** - GPU efficiency (batching maximizes GPU utilization) - Throughput (batch processing is faster than N individual calls) - Cost (fewer GPU instances needed for same throughput) **Where this breaks:** - **Latency**: Waiting for batch adds delay (batching_delay = timeout_ms) - **Complexity**: Async code, result distribution, timeout management - **Fairness**: Large requests can starve small requests **When to use it:** - GPU-based models (batching gives 5-10× throughput improvement) - High throughput requirements (> 1000 req/sec) - Can tolerate added latency (10-100ms batching delay acceptable) - Cost optimization matters (GPU time is expensive) **The latency-throughput trade-off:** Larger batches = higher throughput but higher latency. Smaller batches = lower latency but lower throughput. There's no free lunch. ## The Patterns Combined Production systems often combine patterns: ### Hybrid: Multi-Worker + Batching ```python # Multiple workers, each doing batching # gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app class HybridModelServer: def __init__(self, model_path): self.batch_server = BatchModelServer( model_path, batch_size=32, timeout_ms=10 ) async def predict(self, request): return await self.batch_server.predict( request.id, request.data ) ``` This gives you: - Worker-level isolation (failures contained) - Batch-level GPU efficiency - Scalability (add workers for more capacity) At the cost of: - Maximum complexity - Maximum memory (N workers × model size) - Hardest to debug ## Latency vs Throughput: The Fundamental Trade-off You cannot optimize for both simultaneously. Every architecture choice moves you along this spectrum. **Optimize for latency:** - Single-process serving (no queueing) - Small or no batching (immediate processing) - Overprovisioned capacity (requests never wait) - Multiple model replicas (parallel serving) **Result:** Low latency, high cost, lower throughput. **Optimize for throughput:** - Large batches (maximize GPU utilization) - Request queueing (keep GPU busy) - Underprovisioned capacity (queue absorbs bursts) - Fewer replicas (higher utilization per instance) **Result:** High throughput, lower cost, higher latency. **Production reality:** Most systems need something in between. The right balance depends on your use case, not universal best practices. ## Failure Modes by Pattern Different patterns fail differently. Understanding failure modes helps you pick the right pattern. ### Single-Process Failure Modes **Head-of-line blocking:** One slow request blocks all subsequent requests. ```python # Request 1: 100ms (normal) # Request 2: 10s (adversarial input) # Request 3: 100ms (normal) - but waits 10s for Request 2 ``` **Mitigation:** Request timeouts, input validation, separate queues for different request types. **Process crash:** One bad input crashes the process, service down until restart. **Mitigation:** Process supervision (systemd, k8s), health checks, circuit breakers. See [Production AI Systems: What 30 Years of UNIX Taught Me](/ai-architecture/production-ai-unix-principles/) for proven failure handling patterns. ### Multi-Worker Failure Modes **Memory exhaustion:** Workers compete for memory, OOM killer terminates processes randomly. **Mitigation:** Resource limits per worker, monitoring memory per worker, autoscaling based on memory not just CPU. **Version skew:** During deployment, some workers have new model, some have old model. Requests get inconsistent results. **Mitigation:** Blue-green deployment (switch all at once) or canary with traffic shaping (gradual migration). ### Batch Serving Failure Modes **Batch contamination:** One bad input in batch causes entire batch to fail. ```python # Batch of 32 requests # Request 15 has malformed input # All 32 requests fail ``` **Mitigation:** Input validation before batching, per-request error handling, batch splitting on failure. **Latency variance:** Some requests wait 1ms for batch, others wait 50ms (full timeout). P99 latency = timeout duration regardless of actual processing time. **Mitigation:** Smaller batches (trades throughput for consistency), adaptive batching (adjust size based on traffic), separate batches by priority. ## Choosing the Right Pattern Ask these questions in order: **1. What's your traffic volume?** - 1000 req/sec → Consider batching **2. What's your latency requirement?** - P99 < 100ms → No batching, multi-worker - P99 < 1s → Small batches (5-10ms timeout) - P99 < 5s → Batching acceptable **3. What hardware are you using?** - CPU-only → Multi-worker (cheap parallelism) - GPU → Batching (essential for GPU efficiency) **4. What's your memory budget?** - Model × N workers fits in RAM → Multi-worker - Model × N workers exceeds RAM → Single-process or batching **5. Can you tolerate complexity?** - No → Single-process or multi-worker - Yes → Batching or hybrid ## Production-Ready Implementation Here's a robust multi-worker serving implementation with proper error handling and observability (see [The Observability Gap in ML Systems](/ai-architecture/ml-observability-gap/) for what to log and why): ```python import time import logging from functools import wraps class ProductionModelServer: def __init__(self, model_path, timeout_ms=100): self.model = self._load_with_retry(model_path) self.timeout_ms = timeout_ms self.stats = {"success": 0, "timeout": 0, "error": 0} def _load_with_retry(self, model_path, retries=3): """Load model with retries and timeout""" for attempt in range(retries): try: logging.info(f"Loading model, attempt {attempt + 1}") model = load_model(model_path, timeout=30) logging.info("Model loaded successfully") return model except Exception as e: logging.error(f"Load failed: {e}") if attempt == retries - 1: raise time.sleep(2 ** attempt) # Exponential backoff def predict(self, request_id, data): """Predict with timeout and comprehensive error handling""" start = time.monotonic() try: # Validate input if not self._validate_input(data): self.stats["error"] += 1 raise ValueError("Invalid input format") # Predict with timeout result = self._predict_with_timeout(data) latency_ms = (time.monotonic() - start) * 1000 self.stats["success"] += 1 logging.info("prediction_success", request_id=request_id, latency_ms=latency_ms ) return result except TimeoutError: self.stats["timeout"] += 1 logging.error("prediction_timeout", request_id=request_id, timeout_ms=self.timeout_ms ) raise except Exception as e: self.stats["error"] += 1 logging.error("prediction_error", request_id=request_id, error=str(e) ) raise def _predict_with_timeout(self, data): """Run prediction with timeout""" # Implementation depends on framework # TensorFlow: use timeout in session.run() # PyTorch: use signal.alarm() or threading return self.model.predict(data) def _validate_input(self, data): """Validate input before prediction""" # Check type, shape, ranges return True def health_check(self): """Health check for load balancer""" return { "status": "healthy", "model_loaded": self.model is not None, "stats": self.stats } ``` ## The Unsexy Truth The right serving pattern depends on your constraints, not what's popular. Single-process serving handles more traffic than people think. Multi-worker serving solves most problems without exotic infrastructure. Batching is essential for GPUs but adds complexity. Start simple. Measure. Add complexity only when you have evidence it's needed. Most serving performance problems aren't architectural - they're resource limits, memory leaks, or inefficient models. Fix those first before redesigning your architecture. The best serving architecture is the simplest one that meets your requirements. Everything else is premature optimization.1G --- ## Production AI Systems: What 30 Years of UNIX Taught Me **URL**: https://speytech.com/ai-architecture/production-ai-unix-principles/ **Published**: January 13, 2026 19:51 **Topic**: AI Architecture The infrastructure principles that kept systems running still apply to ML ## The Problem Nobody Talks About I spent three decades keeping UNIX systems running in production. Banks, telcos, healthcare - places where downtime meant actual consequences. When I started working with ML systems five years ago, I expected them to be different. They're not. The same infrastructure problems that plagued distributed systems in 1997 are plaguing ML systems in 2026. We've just renamed them and added GPUs. Models fail to load. Predictions time out. Memory leaks crash inference servers. Log files grow until they fill the disk. Race conditions corrupt model state. The exact problems we solved in UNIX systems engineering, now dressed up in Python and TensorFlow. The difference is that most ML teams don't realize they're building distributed systems. They think they're doing "AI engineering" when they're actually doing systems engineering with models attached. ## What UNIX Got Right UNIX survived because it had principles, not just implementations. These principles emerged from decades of production experience, refined by failure. They're not sexy. They're not new. But they work. ### Principle 1: Everything Fails UNIX assumes failure. Processes crash. Disks fill. Networks partition. This isn't pessimism - it's realism. ML systems often assume success. The model loads. The prediction completes. The GPU is available. When these assumptions break, the system has no plan. **UNIX approach:** Process supervision (init, systemd). If a daemon crashes, restart it. If it crashes repeatedly, stop trying and alert someone. **ML equivalent:** Model serving should assume models fail to load, predictions timeout, and GPUs disappear. Every operation needs a failure path. ```python def serve_prediction(model_id, input_data, timeout_ms=100): """Production serving with UNIX-style failure handling""" try: # Try to load model with timeout model = load_model_with_timeout(model_id, timeout_ms=5000) # Predict with resource limits result = predict_with_limits( model, input_data, memory_mb=512, timeout_ms=timeout_ms ) return {"status": "success", "prediction": result} except ModelLoadTimeout: # Model didn't load in time - use fallback log.error("model_load_timeout", model_id=model_id) return {"status": "fallback", "reason": "model_load_timeout"} except PredictionTimeout: # Prediction took too long - fail fast log.error("prediction_timeout", model_id=model_id) return {"status": "timeout", "latency_ms": timeout_ms} except MemoryError: # Out of memory - clear and retry once gc.collect() log.warning("oom_retry", model_id=model_id) # Return cached result or fallback return get_cached_or_fallback(model_id, input_data) ``` This isn't paranoia. It's the same thinking that made UNIX reliable: assume failure, handle it explicitly, fail gracefully. ### Principle 2: Observability Through Logs UNIX philosophy: everything important should emit structured logs. Not pretty dashboards. Not ML-specific "observability platforms." Just logs. Logs survive system crashes. Logs can be grep'd. Logs work when your monitoring infrastructure is down. Logs are boring, which is why they work. (For a deeper dive on what to log in ML systems specifically, see [The Observability Gap in ML Systems](/ai-architecture/ml-observability-gap/).) ML systems often log the wrong things. Training loss curves. Model accuracy. Hyperparameters. These matter for research. They don't help at 3AM when predictions are failing. **What to log:** ```python # Model loading log.info("model_load_start", model_id=model_id, version=version, size_mb=model_size) # Model loaded successfully log.info("model_load_success", model_id=model_id, load_time_ms=elapsed, memory_mb=memory_used) # Prediction executed log.info("prediction", model_id=model_id, input_hash=hash(input_data), latency_ms=latency, result_hash=hash(result)) # Resource exhaustion approaching log.warning("memory_pressure", used_mb=memory_used, available_mb=memory_available, threshold_pct=80) ``` Notice what's not there: model internals, feature importance, gradient norms. Those belong in training logs. Production logs answer operational questions: Did it work? How long did it take? What resources did it consume? ### Principle 3: Processes Are Cheap, State Is Expensive UNIX makes process creation cheap deliberately. Fork, exec, run, exit. No complex lifecycle management. No shared state between processes unless explicitly designed. ML systems often invert this. They create one long-lived process (a "model server") that holds expensive state (loaded models) and serves many requests. When that process fails, all state is lost. **UNIX pattern applied to ML:** Instead of one process serving all models, run one process per model. When a model crashes, only that model fails. Other models keep serving. ```bash # Traditional approach - one server, all models model_server --models=modelA,modelB,modelC # UNIX approach - separate processes model_server --model=modelA & model_server --model=modelB & model_server --model=modelC & ``` Yes, this uses more memory. Yes, it's "less efficient." But it's more reliable, which matters more in production. Memory is cheap. Debugging cascading failures at 3AM is expensive. ### Principle 4: Text Streams Beat Custom Formats UNIX pipes work because everything speaks text. `ps | grep | awk | sort` - different tools, same interface. ML systems love custom formats. Pickle files. TensorFlow SavedModels. ONNX. Each format needs specific loading code. Each format can break in unique ways. **Better approach:** Use standard formats for everything outside the model itself. - **Inputs:** JSON or Protocol Buffers (parseable, validatable) - **Outputs:** JSON (compatible with every tool) - **Configs:** YAML or TOML (human-readable, version-controllable) - **Logs:** Structured text (grep-able, parseable) When the model serving process crashes, you can still parse its logs, validate its configs, and inspect its inputs. You don't need model-specific tools to understand what happened. ### Principle 5: Small Tools, Composed UNIX provides `cat`, `grep`, `sort`, `uniq` - each does one thing. Composition creates power. ML systems often build monoliths. One codebase for training, serving, monitoring, and retraining. When any part fails, the whole system is suspect. **Composition in ML:** ``` # Training - writes model to disk train_model --config=config.yaml --output=model.pt # Serving - loads model from disk serve_model --model=model.pt --port=8000 # Monitoring - reads serving logs monitor_serving --logs=/var/log/serving --alert-threshold=0.95 # Retraining trigger - watches monitoring retrain_trigger --monitor=localhost:8001 --threshold=0.90 ``` Each process does one thing. They communicate through files and logs. When serving breaks, training keeps working. When monitoring breaks, serving keeps working. ## Where ML Systems Are Actually Different UNIX principles apply broadly, but ML systems do have unique characteristics. **Statistical failure modes:** A model can be "working" (no crashes) but producing garbage predictions. Traditional systems don't have this problem - a crashed process is obviously broken. A model serving confidently wrong predictions requires different monitoring. **Resource elasticity:** Models can consume wildly different resources for different inputs. A 10-word sentence and a 10,000-word document hit the same endpoint but need different resources. UNIX services tend to have more predictable resource usage. **Versioning semantics:** Deploying a new model version isn't like deploying new code. The interface stays the same but the behavior changes fundamentally. This requires different rollout strategies than traditional deployments. These differences are real. But they don't invalidate UNIX principles - they extend them. ## Applying This Tomorrow If you're running ML systems in production and they're unreliable, start with UNIX basics: **Week 1:** Add structured logging to model serving. Log every load, every prediction, every failure. Make logs grep-able. **Week 2:** Add process supervision. If your model server crashes, restart it automatically. If it crashes repeatedly, page someone. **Week 3:** Add resource limits. Cap memory usage, timeout predictions, fail fast when resources are exhausted. **Week 4:** Isolate failures. Run risky models in separate processes. Don't let one bad model take down the whole service. None of this is ML-specific. It's just systems engineering. The same principles that kept Solaris servers running in 1997 will keep your ML systems running in 2026. ## The Unsexy Truth Production ML systems fail for boring reasons. Disk full. Memory leak. Network timeout. Configuration typo. Process crash. These are solved problems. UNIX solved them decades ago. We just need to remember the solutions. AI systems are systems first, AI second. Treat them like systems and they'll be more reliable than if you treat them like magic. The principles that kept UNIX running for 30 years still work. They're just not as exciting as talking about transformers and embeddings. But at 3AM when production is down, you'll care a lot more about process supervision than you will about attention mechanisms. Start with the boring infrastructure. Make it reliable. Then add the AI on top. Not the other way around. --- ## The Observability Gap in ML Systems **URL**: https://speytech.com/ai-architecture/ml-observability-gap/ **Published**: January 13, 2026 19:15 **Topic**: AI Architecture Why your model serving cluster fails at 3AM and you can't figure out why ## The 3AM Page The model serving cluster is down. Again. Production traffic is failing. The error message says "Internal Server Error." The logs say nothing useful. You ssh into a pod. CPU looks fine. Memory looks fine. The model loaded successfully an hour ago. Predictions were working. Then they stopped. No obvious trigger. No deployment. No config change. Just... stopped. You restart the pods. Traffic recovers. Problem "solved." You go back to bed. Three days later, it happens again. Same symptoms. Same non-explanation. Same restart-and-hope fix. This is the observability gap in ML systems. Traditional monitoring tells you the system is broken. It doesn't tell you *why* the model stopped making predictions. ## What Traditional Observability Misses Standard monitoring tracks system metrics: CPU, memory, disk, network. These work for stateless services. They fail for ML systems because the interesting failures happen inside the model. **System says:** "Everything's fine! CPU at 40%, memory at 60%." **Reality:** The model is returning garbage predictions because input distributions shifted and nobody noticed. **System says:** "Pod restarted due to OOM kill." **Reality:** A single adversarial input caused the model to allocate unbounded memory, but you'll never know which input because it's not logged. **System says:** "P99 latency increased from 50ms to 200ms." **Reality:** A specific class of inputs takes 10x longer to process, but you don't know which class or why. Traditional observability gives you symptoms. ML observability requires understanding model behavior, not just process health. ## The Missing Logs When I started working with ML systems, I asked: "Where are the prediction logs?" The response: "We log accuracy metrics to MLflow." That's not observability. That's research tracking. In production, I need to answer different questions: - Which prediction failed? - What input caused this latency spike? - Did the model see this input pattern before? - When did prediction quality start degrading? - Which model version served this request? None of these questions are answerable from accuracy dashboards or system metrics. They require logging what actually happened during serving. ## What to Log (And What Not To) ### Log Every Prediction Not just errors. Not just slow requests. Every single prediction. ```python def serve_prediction(model_id, input_data, request_id): start = time.monotonic() # Compute input hash for deduplication/lookup input_hash = hashlib.sha256( json.dumps(input_data, sort_keys=True).encode() ).hexdigest()[:16] try: result = model.predict(input_data) latency_ms = (time.monotonic() - start) * 1000 # Log success with context log.info("prediction_success", request_id=request_id, model_id=model_id, model_version=model.version, input_hash=input_hash, input_size=len(str(input_data)), output_hash=hash(str(result))[:16], latency_ms=latency_ms, timestamp=time.time() ) return result except Exception as e: latency_ms = (time.monotonic() - start) * 1000 # Log failure with same context log.error("prediction_failure", request_id=request_id, model_id=model_id, model_version=model.version, input_hash=input_hash, error_type=type(e).__name__, error_msg=str(e)[:200], # Truncate long errors latency_ms=latency_ms, timestamp=time.time() ) raise ``` This gives you a complete prediction history. When something breaks, you can reconstruct exactly what happened. ### Log Input Characteristics Don't log the raw input (privacy, storage cost). Log characteristics that help debug. ```python def log_input_stats(input_data): """Log statistical properties of input""" if isinstance(input_data, dict): stats = { "num_fields": len(input_data), "field_names": sorted(input_data.keys()), "total_size_bytes": len(json.dumps(input_data)) } # For numeric fields, log ranges for key, value in input_data.items(): if isinstance(value, (int, float)): stats[f"{key}_value"] = value elif isinstance(input_data, list): stats = { "list_length": len(input_data), "item_types": list(set(type(x).__name__ for x in input_data)), "total_size_bytes": len(json.dumps(input_data)) } log.info("input_characteristics", **stats) ``` When you see "P99 latency spiked at 2AM," you can correlate with "inputs with list_length > 1000 started appearing at 2AM." ### Log Model State Changes Models aren't static. They get loaded, unloaded, swapped, updated. Log every state transition. ```python class ModelServer: def load_model(self, model_id, version): log.info("model_load_start", model_id=model_id, version=version, timestamp=time.time() ) try: model = load_model_from_storage(model_id, version) memory_mb = get_model_memory_usage(model) log.info("model_load_success", model_id=model_id, version=version, memory_mb=memory_mb, load_time_ms=(time.monotonic() - start) * 1000, timestamp=time.time() ) return model except Exception as e: log.error("model_load_failure", model_id=model_id, version=version, error=str(e), timestamp=time.time() ) raise ``` When models mysteriously stop working, you can see: "Model B version 2.1 was loaded at 02:47, failures started at 02:48." ## What NOT to Log **Don't log model internals during serving.** No gradients, no activations, no attention weights. These are expensive to compute and rarely useful for production debugging. **Don't log PII.** Hash inputs instead of logging raw data. If you need to debug specific inputs, store hashes and retrieve inputs separately with proper access controls. **Don't log everything to stdout.** Use structured logging (JSON) that can be parsed and indexed. Use log levels appropriately (INFO for normal operations, WARN for degraded states, ERROR for failures). ## The Correlation Problem Logs are useless if you can't correlate them. Every log entry needs a request ID that spans the entire request lifecycle. ```python @app.route('/predict', methods=['POST']) def predict_endpoint(): # Generate request ID at entry point request_id = str(uuid.uuid4()) log.info("request_start", request_id=request_id, endpoint="/predict", timestamp=time.time() ) try: # Pass request_id through entire stack result = serve_prediction( model_id=request.json['model_id'], input_data=request.json['input'], request_id=request_id ) log.info("request_success", request_id=request_id, status_code=200, timestamp=time.time() ) return jsonify(result), 200 except Exception as e: log.error("request_failure", request_id=request_id, error=str(e), timestamp=time.time() ) return jsonify({"error": "Internal error"}), 500 ``` Now when you see a failure, you can grep for the request_id and see the entire request flow: when it started, which model served it, what the input characteristics were, where it failed. ## Detection vs. Diagnosis Traditional monitoring detects problems: "Latency increased." ML observability enables diagnosis: "Latency increased because inputs with >500 tokens started appearing, and those take 10x longer to process." Detection gets you paged. Diagnosis gets you back to sleep. **Without proper logging:** - "Model is failing" → restart pods, hope it works - "Latency increased" → scale up, hope it helps - "Accuracy dropped" → no idea when or why **With proper logging:** - "Model is failing" → specific input pattern triggers OOM, add input validation - "Latency increased" → P99 driven by large inputs, add size limits or separate queue - "Accuracy dropped" → distribution shift detected at specific timestamp, trigger retraining ## The Storage Cost Objection "But logging every prediction is expensive!" Yes. It costs money. Know what costs more? Not being able to debug production issues. Not knowing when your model started failing. Not being able to reproduce incidents. ML systems that work at 3AM are worth more than the S3 bill for prediction logs. **Practical cost management:** - Use log sampling for high-volume endpoints (log 1 in 100 for routine requests, 100% for errors) - Compress logs before storage - Use log retention policies (7 days hot, 30 days warm, archive after 90) - Store aggregated statistics rather than every single prediction But start with logging everything. Optimize later when you know what you actually need. ## Observability Enables Everything Else Good observability isn't just for debugging. It enables: **Model monitoring:** You can't detect drift without comparing current inputs to historical inputs. **A/B testing:** You can't measure model improvements without detailed prediction logs. **Incident response:** You can't fix what you can't see. **Compliance:** You can't audit model decisions without prediction history. **Cost optimization:** You can't optimize what you don't measure. Observability is infrastructure. Like networking or storage, you build it once and everything else benefits. ## Start Tomorrow The principles here build on [Production AI Systems: What 30 Years of UNIX Taught Me](/ai-architecture/production-ai-unix-principles/) - observability is just one UNIX principle applied to ML. If your ML serving system has poor observability: **Day 1:** Add structured logging to every prediction. Log input hash, output hash, latency, model version, timestamp. **Day 2:** Add request IDs that span the entire request lifecycle. Now you can correlate logs across services. **Day 3:** Add input characteristic logging. Log sizes, types, statistical properties - not raw data. **Day 4:** Set up log aggregation (ELK, Splunk, CloudWatch Logs - doesn't matter which). Make logs searchable. **Day 5:** Create dashboards that matter: prediction volume over time, latency percentiles by input size, error rates by model version. This isn't ML-specific. It's just observability applied to ML systems. The same logging discipline that kept traditional services debuggable works for models too. ## The Unsexy Truth (Again) Production ML failures are debuggable. But only if you log what matters. The interesting failures aren't visible in CPU graphs or memory charts. They're visible in prediction logs, input characteristics, and model state transitions. Most ML teams discover this the hard way, after the third 3AM page for an issue they can't diagnose. Build observability first. Add ML second. Not the other way around. Then when production breaks at 3AM, you'll actually be able to figure out why. --- # Certifiable ML Ecosystem The certifiable-* projects form a complete deterministic ML pipeline: ``` certifiable-data → Data loading, normalization, Feistel shuffling ↓ certifiable-training → Fixed-point gradient descent, Merkle audit trails ↓ certifiable-quant → FP32→Q16.16 quantization with error certificates ↓ certifiable-deploy → Cryptographic model packaging and attestation ↓ certifiable-inference → Bit-identical inference across x86/ARM/RISC-V ↓ certifiable-monitor → Runtime drift detection, COE compliance ↓ certifiable-verify → Pipeline verification, hash-only and full replay Testing: certifiable-harness → End-to-end integration testing certifiable-bench → Performance benchmarking with bit-identity verification ``` All projects use: - Pure C99, no dynamic allocation - Q16.16 fixed-point arithmetic - Round-to-nearest-even (RNE) rounding - SHA-256 hash chains for auditability - Bit-identical results across platforms --- # Contact - Website: https://speytech.com/contact/ - GitHub: https://github.com/SpeyTech - LinkedIn: https://www.linkedin.com/in/william-murray-5180aa32b/ --- *Generated automatically from SpeyTech.com content*