Production ML systems fail for boring reasons. Not exotic model architecture problems — infrastructure problems. Non-deterministic data loading. Floating-point variance across hardware. Training runs that can’t be reproduced. Deployed models that don’t match what was tested.
For most applications, these are annoyances. For safety-critical systems — medical devices, autonomous vehicles, aerospace — they’re disqualifying. You can’t certify software that behaves differently each time it runs.
This article describes a complete ML pipeline designed from first principles for deterministic execution and safety-critical certification. Five open-source projects, 771 tests, and a single invariant: same input → same output, always.
The Problem with Standard ML Pipelines
A typical ML pipeline has non-determinism at every stage:
Data loading: Shuffle order depends on PRNG state. Multi-threaded loaders return batches in unpredictable order. Floating-point normalisation varies by platform.
Training: Gradient reduction order affects floating-point accumulation. Dropout masks aren’t reproducible. Optimizer state depends on execution order.
Quantization: Error bounds unknown. Calibration statistics non-reproducible. No proof that quantized model preserves behaviour.
Deployment: No binding between certified model and deployed artifact. No verification that weights weren’t corrupted in transit.
Inference: Different results on different hardware. Memory allocation timing affects cache state. No audit trail linking prediction to model version.
Each of these is a solvable problem. But solving them requires designing the entire pipeline around determinism, not bolting it on afterward.
The Certifiable Pipeline
The solution is five projects that share common principles:
certifiable-data → certifiable-training → certifiable-quant → certifiable-deploy → certifiable-inferenceEach project is pure C99, uses zero dynamic allocation, and produces bit-identical output across platforms.
certifiable-data
Problem: Standard data loaders introduce non-determinism through random shuffling, floating-point normalisation, and multi-threaded batch assembly.
Solution: Data loading as a pure function. Given dataset D, seed s, epoch e, and batch index t, the output is deterministic:
B_t = Pipeline(D, s, e, t)Key components:
- Feistel shuffling: Cryptographic permutation that maps any index to its shuffled position in O(1) time. No sequential state, no execution-order dependence.
- Fixed-point normalisation: Q16.16 arithmetic with precomputed inverse standard deviation. No floating-point division.
- Merkle provenance: Every epoch produces a cryptographic commitment linking dataset hash, configuration, and shuffle seed.
The same (dataset, seed, epoch, batch) produces the same samples, in the same order, with the same normalisation, forever.
Related reading: The ML Non-Determinism Problem, Bit-Perfect Reproducibility
certifiable-training
Problem: Training loops accumulate floating-point rounding errors non-deterministically. Gradient reduction order affects results. There’s no audit trail proving what happened during training.
Solution: Fixed-point gradients, deterministic reduction topology, and Merkle-chained step verification.
Key components:
- Q8.24 gradients: Higher precision for gradient accumulation, explicit rounding at each operation.
- Compensated summation: Neumaier algorithm tracks rounding error, making large reductions accurate.
- Fixed reduction tree: Binary tree topology eliminates execution-order dependence.
- Merkle chain: Every training step produces a hash linking previous state, gradients, and new weights. Any step is independently verifiable.
The training loop becomes a deterministic state machine. Given initial weights and training data, the final weights are reproducible bit-for-bit.
Related reading: From Proofs to Code: Mathematical Transcription in C, Cryptographic Execution Tracing
certifiable-quant
Problem: Standard quantizers (TensorFlow Lite, ONNX) are black boxes. Error bounds unknown. No proof that quantized model preserves behaviour. Calibration non-reproducible.
Solution: Provable FP32→Q16.16 quantization with formal error certificates.
Key components:
- Theoretical analysis: Compute error bounds before quantization. Overflow proofs, range propagation, Lipschitz constants.
- Empirical calibration: Collect activation statistics with coverage metrics and degenerate detection.
- Verified conversion: Quantize with explicit error tracking, verify against theoretical bounds.
- Cryptographic certificate: Merkle root linking analysis, calibration, and verification digests.
The output is a certificate proving the quantized model is within ε of the original, auditable forever.
Related reading: Fixed-Point Neural Networks: The Math Behind Q16.16, Closure, Totality, and the Algebra of Safe Systems
certifiable-deploy
Problem: How do you prove the deployed model matches what was certified? How do you verify weights weren’t tampered with? How do you bind artifacts to specific hardware?
Solution: Canonical bundle format with cryptographic attestation and target binding.
Key components:
- CBF v1 format: Deterministic container with no ambient metadata. Same content = same bytes = same hash.
- Merkle attestation: 4-leaf tree binding manifest, weights, certificates, and inference artifacts.
- JCS manifest: RFC 8785 canonical JSON for deterministic serialization.
- Target binding: Lock bundles to specific platforms (arch-vendor-device-abi).
- Runtime loader: Fail-closed state machine. Inference API unreachable without completing verification chain.
The loader implements “Execution ⇒ Verification”: no inference without cryptographic proof that the loaded model matches the certified model.
Related reading: DO-178C Level A Certification, The Real Cost of Dynamic Memory
certifiable-inference
Problem: Different hardware produces different results. Memory allocation affects timing. No audit trail linking predictions to model versions.
Solution: Integer-only forward pass with static allocation and deterministic execution paths.
Key components:
- Fixed-point arithmetic: Q16.16 for all weights and activations. No floating-point in the critical path.
- Static buffers: All memory provided by caller. No malloc in inference loop.
- Bounded execution: Fixed iteration counts, no data-dependent branching, WCET-analysable.
The same input produces the same output on x86, ARM, and RISC-V. Bit-identical, every time.
Related reading: WCET for Neural Network Inference, IEC 62304 Class C Requirements
The Three Theorems
Every project in the certifiable family satisfies three properties:
| Theorem | Statement | Implication |
|---|---|---|
| Bit Identity | F_A(s) = F_B(s) for any DVM-compliant platforms A, B | Cross-platform reproducibility |
| Bounded Error | Error saturates, doesn’t accumulate | Predictable behaviour |
| Auditability | Any operation verifiable in O(1) time | Scalable verification |
These aren’t aspirations — they’re tested properties. 771 tests across 39 test suites verify these invariants hold.
What This Enables
Reproducible Research
Training runs that produce identical results, always. No “worked on my machine” failures. No unexplained variance between runs.
Certification Evidence
DO-178C Level A requires complete requirements traceability. IEC 62304 Class C requires validated transformations. ISO 26262 ASIL-D requires provable behaviour. This pipeline provides the evidence these standards demand.
Incident Investigation
When something goes wrong, you can replay exact inputs through exact model versions and get exact outputs. No “we can’t reproduce the failure” mysteries.
Audit Trails
Cryptographic proof linking every prediction to: the deployed bundle, the quantization certificate, the training Merkle chain, and the data provenance. Years later, you can prove exactly how that prediction was made.
Test Coverage
The pipeline is backed by 771 tests across 39 test suites:
| Project | Tests | Suites |
|---|---|---|
| certifiable-data | 133 | 8 |
| certifiable-training | 223 | 10 |
| certifiable-quant | 150 | 7 |
| certifiable-deploy | 201 | 7 |
| certifiable-inference | 64 | 7 |
| Total | 771 | 39 |
These aren’t smoke tests. They verify bit-identical output across platforms, correct handling of edge cases, and conformance to formal specifications.
Getting Started
Each project builds with CMake:
git clone https://github.com/williamofai/certifiable-{name}
cd certifiable-{name}
mkdir build && cd build
cmake ..
make
make testDocumentation includes mathematical foundations (CT-MATH-001), data structure specifications (CT-STRUCT-001), and Software Requirements Specifications with full traceability.
When You Don’t Need This
This level of rigour has costs: more complex implementation, restricted arithmetic, verbose documentation. If you’re building a recommendation system or a chatbot, standard frameworks are fine.
You need this when:
- Regulatory standards require reproducible behaviour (DO-178C, IEC 62304, ISO 26262)
- Failures have safety implications
- You need to prove what happened, not just log it
- Cross-platform consistency matters
Further Reading
Core Concepts:
- Bit-Perfect Reproducibility: Why It Matters and How to Prove It
- The ML Non-Determinism Problem
- From Proofs to Code: Mathematical Transcription in C
Safety-Critical Standards:
- DO-178C Level A Certification
- IEC 62304 Class C: What Medical Device Software Actually Requires
- ISO 26262 and ASIL-D: The Role of Determinism
Infrastructure Patterns:
- The Real Cost of Dynamic Memory in Safety-Critical Systems
- Cryptographic Execution Tracing and Evidentiary Integrity
- WCET for Neural Network Inference
Production ML:
Context: This pipeline is designed for safety-critical applications with specific regulatory requirements. For most ML applications, standard frameworks provide better tooling, larger ecosystems, and faster development. Evaluate the trade-offs in your specific context.
The goal isn’t complexity for its own sake. It’s providing the evidence that safety-critical certification demands. When a regulator asks “prove this model behaves deterministically,” you need more than good intentions — you need 771 tests and cryptographic audit trails.
All projects are open source under GPL-3.0, with commercial licensing available for proprietary safety-critical systems.