A Complete Deterministic ML Pipeline for Safety-Critical Systems

Complete deterministic ML pipeline showing certifiable-data, certifiable-training, certifiable-quant, certifiable-deploy, and certifiable-inference with test counts and core principles

Production ML systems fail for boring reasons. Not exotic model architecture problems — infrastructure problems. Non-deterministic data loading. Floating-point variance across hardware. Training runs that can’t be reproduced. Deployed models that don’t match what was tested.

For most applications, these are annoyances. For safety-critical systems — medical devices, autonomous vehicles, aerospace — they’re disqualifying. You can’t certify software that behaves differently each time it runs.

This article describes a complete ML pipeline designed from first principles for deterministic execution and safety-critical certification. Five open-source projects, 771 tests, and a single invariant: same input → same output, always.

The Problem with Standard ML Pipelines

A typical ML pipeline has non-determinism at every stage:

Data loading: Shuffle order depends on PRNG state. Multi-threaded loaders return batches in unpredictable order. Floating-point normalisation varies by platform.

Training: Gradient reduction order affects floating-point accumulation. Dropout masks aren’t reproducible. Optimizer state depends on execution order.

Quantization: Error bounds unknown. Calibration statistics non-reproducible. No proof that quantized model preserves behaviour.

Deployment: No binding between certified model and deployed artifact. No verification that weights weren’t corrupted in transit.

Inference: Different results on different hardware. Memory allocation timing affects cache state. No audit trail linking prediction to model version.

Each of these is a solvable problem. But solving them requires designing the entire pipeline around determinism, not bolting it on afterward.

The Certifiable Pipeline

The solution is five projects that share common principles:

certifiable-data → certifiable-training → certifiable-quant → certifiable-deploy → certifiable-inference

Each project is pure C99, uses zero dynamic allocation, and produces bit-identical output across platforms.

certifiable-data

Problem: Standard data loaders introduce non-determinism through random shuffling, floating-point normalisation, and multi-threaded batch assembly.

Solution: Data loading as a pure function. Given dataset D, seed s, epoch e, and batch index t, the output is deterministic:

B_t = Pipeline(D, s, e, t)

Key components:

Feistel shuffling: Cryptographic permutation that maps any index to its shuffled position in O(1) time. No sequential state, no execution-order dependence.
Fixed-point normalisation: Q16.16 arithmetic with precomputed inverse standard deviation. No floating-point division.
Merkle provenance: Every epoch produces a cryptographic commitment linking dataset hash, configuration, and shuffle seed.

The same (dataset, seed, epoch, batch) produces the same samples, in the same order, with the same normalisation, forever.

certifiable-training

Problem: Training loops accumulate floating-point rounding errors non-deterministically. Gradient reduction order affects results. There’s no audit trail proving what happened during training.

Solution: Fixed-point gradients, deterministic reduction topology, and Merkle-chained step verification.

Key components:

Q8.24 gradients: Higher precision for gradient accumulation, explicit rounding at each operation.
Compensated summation: Neumaier algorithm tracks rounding error, making large reductions accurate.
Fixed reduction tree: Binary tree topology eliminates execution-order dependence.
Merkle chain: Every training step produces a hash linking previous state, gradients, and new weights. Any step is independently verifiable.

The training loop becomes a deterministic state machine. Given initial weights and training data, the final weights are reproducible bit-for-bit.

certifiable-quant

Problem: Standard quantizers (TensorFlow Lite, ONNX) are black boxes. Error bounds unknown. No proof that quantized model preserves behaviour. Calibration non-reproducible.

Solution: Provable FP32→Q16.16 quantization with formal error certificates.

Key components:

Theoretical analysis: Compute error bounds before quantization. Overflow proofs, range propagation, Lipschitz constants.
Empirical calibration: Collect activation statistics with coverage metrics and degenerate detection.
Verified conversion: Quantize with explicit error tracking, verify against theoretical bounds.
Cryptographic certificate: Merkle root linking analysis, calibration, and verification digests.

The output is a certificate proving the quantized model is within ε of the original, auditable forever.

certifiable-deploy

Problem: How do you prove the deployed model matches what was certified? How do you verify weights weren’t tampered with? How do you bind artifacts to specific hardware?

Solution: Canonical bundle format with cryptographic attestation and target binding.

Key components:

CBF v1 format: Deterministic container with no ambient metadata. Same content = same bytes = same hash.
Merkle attestation: 4-leaf tree binding manifest, weights, certificates, and inference artifacts.
JCS manifest: RFC 8785 canonical JSON for deterministic serialization.
Target binding: Lock bundles to specific platforms (arch-vendor-device-abi).
Runtime loader: Fail-closed state machine. Inference API unreachable without completing verification chain.

The loader implements “Execution ⇒ Verification”: no inference without cryptographic proof that the loaded model matches the certified model.

certifiable-inference

Problem: Different hardware produces different results. Memory allocation affects timing. No audit trail linking predictions to model versions.

Solution: Integer-only forward pass with static allocation and deterministic execution paths.

Key components:

Fixed-point arithmetic: Q16.16 for all weights and activations. No floating-point in the critical path.
Static buffers: All memory provided by caller. No malloc in inference loop.
Bounded execution: Fixed iteration counts, no data-dependent branching, WCET-analysable.

The same input produces the same output on x86, ARM, and RISC-V. Bit-identical, every time.

The Three Theorems

Every project in the certifiable family satisfies three properties:

Theorem	Statement	Implication
Bit Identity	F_A(s) = F_B(s) for any DVM-compliant platforms A, B	Cross-platform reproducibility
Bounded Error	Error saturates, doesn’t accumulate	Predictable behaviour
Auditability	Any operation verifiable in O(1) time	Scalable verification

These aren’t aspirations — they’re tested properties. 771 tests across 39 test suites verify these invariants hold.

What This Enables

Reproducible Research

Training runs that produce identical results, always. No “worked on my machine” failures. No unexplained variance between runs.

Certification Evidence

DO-178C Level A requires complete requirements traceability. IEC 62304 Class C requires validated transformations. ISO 26262 ASIL-D requires provable behaviour. This pipeline provides the evidence these standards demand.

Incident Investigation

When something goes wrong, you can replay exact inputs through exact model versions and get exact outputs. No “we can’t reproduce the failure” mysteries.

Audit Trails

Cryptographic proof linking every prediction to: the deployed bundle, the quantization certificate, the training Merkle chain, and the data provenance. Years later, you can prove exactly how that prediction was made.

Test Coverage

The pipeline is backed by 771 tests across 39 test suites:

Project	Tests	Suites
certifiable-data	133	8
certifiable-training	223	10
certifiable-quant	150	7
certifiable-deploy	201	7
certifiable-inference	64	7
Total	771	39

These aren’t smoke tests. They verify bit-identical output across platforms, correct handling of edge cases, and conformance to formal specifications.

Getting Started

Each project builds with CMake:

git clone https://github.com/williamofai/certifiable-{name}
cd certifiable-{name}
mkdir build && cd build
cmake ..
make
make test

Documentation includes mathematical foundations (CT-MATH-001), data structure specifications (CT-STRUCT-001), and Software Requirements Specifications with full traceability.

When You Don’t Need This

This level of rigour has costs: more complex implementation, restricted arithmetic, verbose documentation. If you’re building a recommendation system or a chatbot, standard frameworks are fine.

You need this when:

Regulatory standards require reproducible behaviour (DO-178C, IEC 62304, ISO 26262)
Failures have safety implications
You need to prove what happened, not just log it
Cross-platform consistency matters

A Complete Deterministic ML Pipeline for Safety-Critical Systems

The Problem with Standard ML Pipelines

The Certifiable Pipeline

certifiable-data

certifiable-training

certifiable-quant

certifiable-deploy

certifiable-inference

The Three Theorems

What This Enables

Reproducible Research

Certification Evidence

Incident Investigation

Audit Trails

Test Coverage

Getting Started

When You Don’t Need This

Further Reading

About the Author

Let's Discuss Your AI Infrastructure

A Complete Deterministic ML Pipeline for Safety-Critical Systems

The Problem with Standard ML Pipelines

The Certifiable Pipeline

certifiable-data

certifiable-training

certifiable-quant

certifiable-deploy

certifiable-inference

The Three Theorems

What This Enables

Reproducible Research

Certification Evidence

Incident Investigation

Audit Trails

Test Coverage

Getting Started

When You Don’t Need This

Further Reading

About the Author

Occasional Technical Updates

Let's Discuss Your AI Infrastructure