“It works on my machine.”
That phrase has ended more debugging sessions — and started more arguments — than any other in software engineering. For machine learning, it’s worse: “The model trained on my machine.”
Does that mean it will train identically on a cloud server? On your colleague’s laptop? On the production hardware?
Usually, no. But for safety-critical systems, “usually” isn’t good enough. We need “always.”
The Goal
The certifiable-* ecosystem makes a strong claim: run the same training pipeline with the same inputs, and you get the same outputs — bit-for-bit identical — regardless of platform.
Not “close enough.” Not “within floating-point tolerance.” Identical.
Here’s what that looks like in practice:
{
"platform": "x86_64-linux",
"stages": [
{"name": "data", "hash": "2f0c6228001d125032afbe..."},
{"name": "training", "hash": "36b34d87459ead09c5349d..."},
{"name": "quant", "hash": "8c78bae645d6f06a3bdd6c..."},
{"name": "deploy", "hash": "32296bbc342c91ba0c95d1..."},
{"name": "inference", "hash": "48f4ecebc0eec79ab15fe6..."},
{"name": "monitor", "hash": "da7f49992d875a6390cb3c..."},
{"name": "verify", "hash": "33e41fcaaa25c405fbb44f..."}
],
"bit_identical": true
}Same JSON from a Google Cloud Debian VM. Same JSON from an 11-year-old MacBook. The hashes match, byte for byte.
What Makes This Hard
Cross-platform determinism is simple in theory and subtle in practice. Here are the traps:
Floating-Point Inconsistency
IEEE-754 floating point has platform-dependent behaviour:
- x87 FPU uses 80-bit extended precision internally
- Fused multiply-add changes rounding sequences
- Compiler optimisations reorder operations
-ffast-mathabandons all guarantees
Even “compliant” implementations can disagree at the last bit.
Solution: Don’t use floating point. Q16.16 fixed-point uses integer arithmetic only. 3 + 5 = 8 on every CPU ever made.
PRNG Divergence
Standard library random functions vary between platforms:
- Different algorithms (LCG vs Mersenne Twister vs xorshift)
- Different seeding behaviour
- Different default states
Solution: Implement your own PRNG. The certifiable-* ecosystem uses a counter-based PRNG that’s a pure function of (seed, operation_id, step). No state, no platform dependencies.
Memory Layout
Struct padding, alignment, and byte order vary:
- sizeof(int) differs between platforms
- Compilers insert padding for alignment
- Big-endian vs little-endian matters for serialisation
Solution: Explicit serialisation with fixed byte order. All data structures use little-endian encoding with no padding. The canonical form is defined in the spec, not left to the compiler.
Library Inconsistency
Even deterministic algorithms can have non-deterministic implementations:
- qsort() stability varies
- Hash table iteration order varies
- Floating-point math library functions vary
Solution: Implement core algorithms from scratch. No standard library dependencies for anything that affects determinism.
The Verification Architecture
The certifiable-harness runs all seven pipeline stages and produces a commitment hash for each:
- Data — Load samples, shuffle with Feistel, hash the batch structure
- Training — Forward pass, backward pass, weight updates, Merkle chain
- Quantisation — FP32 to Q16.16 conversion with error certificates
- Deploy — Bundle packaging with manifest and attestation
- Inference — Forward pass on test data, prediction hashes
- Monitor — Drift detection, operational envelope checks
- Verify — Cross-stage binding verification
Each stage produces a deterministic hash. The harness collects them, compares against golden references, and reports pass/fail.
The Test Matrix
We verified bit-identity across:
| Platform | Architecture | OS | Compiler |
|---|---|---|---|
| Google Cloud | x86_64 | Debian 12 | GCC 12 |
| MacBook (2013) | x86_64 | macOS | Clang 15 |
| Raspberry Pi 4 | ARM64 | Ubuntu 22.04 | GCC 11 |
| RISC-V (SiFive) | RV64 | Buildroot | GCC 13 |
All platforms produce identical hashes. Not similar. Identical.
What We Learned
Compiler Flags Matter
Debug builds (-O0) and release builds (-O3) must produce the same results. Our code does. But we had to verify this — some “equivalent” optimisations change floating-point behaviour. Since we don’t use floating point, this wasn’t an issue.
Integer Overflow Is Defined
In C, signed integer overflow is undefined behaviour. Compilers can assume it never happens and optimise accordingly. We use explicit overflow checks with saturation:
int32_t dvm_add(int32_t a, int32_t b, ct_fault_flags_t *faults)
{
int64_t result = (int64_t)a + (int64_t)b;
if (result > INT32_MAX) {
faults->overflow = 1;
return INT32_MAX; /* Saturate */
}
if (result < INT32_MIN) {
faults->underflow = 1;
return INT32_MIN; /* Saturate */
}
return (int32_t)result;
}Well-defined behaviour on every platform.
Endianness Must Be Explicit
We chose little-endian for all serialisation. ARM and x86 are native little-endian; RISC-V can be either but defaults to little-endian. For big-endian platforms (rare now), we’d need byte-swap routines.
static inline uint32_t to_le32(uint32_t x) {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
return x;
#else
return __builtin_bswap32(x);
#endif
}The SHA-256 Implementation Matters
We embed our own SHA-256, derived from a public domain implementation. Why not use OpenSSL? Because OpenSSL versions differ. The algorithm is identical, but linking against different library versions introduces a dependency we can’t control.
Our SHA-256 produces the NIST test vectors:
- SHA256("") =
e3b0c44298fc1c14... - SHA256(“abc”) =
ba7816bf8f01cfea...
If your implementation matches these, it’ll match ours.
The Practical Impact
For Development
When a test fails, you know it’s your code, not platform variance. Debugging is deterministic. Run the same inputs, get the same outputs, every time.
For Certification
DO-178C, IEC 62304, and ISO 26262 all require evidence of reproducibility. “We ran it twice and got the same result” is weak evidence. “Here are the cryptographic hashes from three different platforms, all identical” is strong evidence.
For Incident Investigation
When a deployed model misbehaves, you can reproduce the exact conditions. Not approximately. Exactly. The training data, the hyperparameters, the weight evolution — all reconstructible from the audit trail.
For Collaboration
Share a seed, share the results. Your colleague in another timezone, on different hardware, will get the same outputs. No more “it works on my machine” debates.
The Cost
Determinism isn’t free:
- Development Effort: Implementing from scratch rather than using libraries
- Performance: Some optimisations aren’t available (SIMD with platform-specific rounding)
- Constraints: No floating point, no standard library random, explicit everything
For many ML applications, this cost isn’t justified. For safety-critical applications, it’s the cost of doing business.
Conclusion
Cross-platform bit-identity transforms “we believe this is reproducible” into “we can prove this is reproducible.” The proof is simple: run the pipeline on different platforms, compare the hashes. If they match, the claim is verified.
The certifiable-* ecosystem achieves this through deliberate architectural choices: fixed-point arithmetic, embedded algorithms, explicit serialisation, comprehensive test vectors. None of these choices are exotic. They’re just disciplined.
Seven pipeline stages. Seven hashes. Identical across platforms. That’s the standard for deterministic ML.
As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. For systems where reproducibility must be proven, not assumed, cross-platform bit-identity is the foundation.
Run the verification yourself: clone certifiable-harness, build on your platform, and compare against the golden references.