Standard ML data pipelines are a major source of non-determinism. Floating-point normalisation varies across platforms. Random shuffling produces different orders each run. Data augmentation introduces uncontrolled variation.
When you can’t reproduce your data pipeline, you can’t reproduce your training. When you can’t reproduce your training, you can’t certify your model.
certifiable-data makes data loading a pure function: B_t = Pipeline(D, seed, epoch, t). Given the same dataset, seed, and indices, you get the same batch — bit for bit, every time.
The Problem
Consider a typical PyTorch data loader:
loader = DataLoader(dataset, shuffle=True, num_workers=4)This single line introduces multiple sources of non-determinism:
- Shuffle order depends on PRNG state, which depends on when you called it
- Floating-point normalisation varies by platform
- Worker processes may return batches in different orders
- Augmentation (if any) introduces random transformations
For research, this doesn’t matter. For safety-critical systems, it’s disqualifying.
The Solution
Deterministic Normalisation
Standard normalisation uses floating-point: y = (x - mean) / std. The division introduces platform-dependent rounding.
certifiable-data uses fixed-point with precomputed inverse:
// Q16.16 fixed-point normalisation
// y = (x - μ) * (1/σ)
// All operations use DVM primitives with RNE rounding
int32_t normalise(int32_t x, int32_t mean, int32_t inv_std,
ct_fault_flags_t *faults) {
int64_t diff = (int64_t)x - (int64_t)mean;
int64_t product = diff * (int64_t)inv_std;
return dvm_round_shift_rne(product, 16, faults);
}The result is deterministic because:
- All arithmetic is integer (no floating-point)
- Rounding uses Round-to-Nearest-Even (RNE), explicitly
- Overflow is handled by saturation with fault flags
Feistel Shuffling
Standard shuffling (Fisher-Yates) requires sequential access and maintains internal state. Different execution orders produce different shuffles.
We use a Cycle-Walking Feistel network — a cryptographic permutation that maps any index to its shuffled position in O(1) time:
uint32_t permute_index(uint32_t index, uint32_t N,
uint64_t seed, uint32_t epoch) {
// Feistel network with cycle-walking
// π: [0, N-1] → [0, N-1] (bijection)
// Same (seed, epoch, index) → same output, always
}Test vectors from CT-MATH-001 §7.2:
N=100, seed=0x123456789ABCDEF0, index=0 → 26
N=100, seed=0x123456789ABCDEF0, index=99 → 41
N=60000, seed=0xFEDCBA9876543210, index=0 → 26382The permutation is a true bijection — every input maps to exactly one output, and every output comes from exactly one input.
Deterministic Augmentation
Data augmentation typically uses random number generators. We use counter-based PRNG with explicit operation IDs:
// Horizontal flip (50% probability)
uint64_t rng = ct_prng(seed, epoch, sample_idx << 16 | OP_FLIP);
bool flip = (rng & 1);
// Random crop
uint32_t crop_x = ct_prng_uniform(seed, epoch,
sample_idx << 16 | OP_CROP_X,
max_x + 1);Same (seed, epoch, sample_idx) produces same augmentation. No hidden state.
Merkle Provenance
Every epoch produces a cryptographic commitment:
h_0 = SHA256(0x03 || H_dataset || H_config || seed)
h_e = SHA256(0x04 || h_{e-1} || H_epoch || e)You can prove exactly what data was used for training. Any tampering breaks the chain.
What’s Implemented
All core modules complete — 142 tests passing across 8 test suites:
| Module | Tests | Coverage |
|---|---|---|
| DVM Primitives | 38 | CT-MATH-001 §3 test vectors |
| PRNG | 13 | Determinism, distribution quality |
| Shuffle | 19 | Bijection verification |
| Normalise | 13 | Correctness, overflow handling |
| Augment | 10 | Flip, crop, noise |
| Batch | 12 | Construction, verification |
| Merkle | 20 | Hashing, provenance chain |
| Bit-Identity | 17 | Cross-platform verification |
Usage Example
#include "ct_types.h"
#include "loader.h"
#include "normalize.h"
#include "shuffle.h"
#include "batch.h"
#include "merkle.h"
// Pre-allocated buffers
ct_sample_t dataset_samples[60000];
ct_dataset_t dataset = {
.samples = dataset_samples,
.num_samples = 60000
};
ct_fault_flags_t faults = {0};
// Load data (deterministic decimal parsing)
ct_load_csv("mnist.csv", &dataset, &faults);
// Setup normalisation
ct_normalize_ctx_t norm_ctx;
ct_normalize_init(&norm_ctx, means, inv_stds, 784);
// Create batch via deterministic shuffle
ct_batch_t batch;
ct_batch_init(&batch, batch_samples, batch_hashes, 32);
ct_batch_fill(&batch, &dataset, batch_index, epoch, seed);
// Verify integrity
int valid = ct_batch_verify(&batch);
// Initialize provenance chain
ct_provenance_t prov;
ct_provenance_init(&prov, dataset_hash, config_hash, seed);The Complete Pipeline
certifiable-data completes the deterministic ML pipeline:
certifiable-data → certifiable-training → certifiable-inference
↓ ↓ ↓
Load data Train model Deploy model
Normalise Merkle chain Bit-perfect
Shuffle Audit trail inference
BatchEvery step is deterministic. Every step is auditable. The same seed produces the same model produces the same predictions, forever.
Why It Matters
Reproducibility Crisis
The ML reproducibility crisis is well-documented. Papers can’t be replicated. Models can’t be reconstructed. Part of the problem is non-deterministic data pipelines — you can’t reproduce training if you can’t reproduce the exact data order.
Certification Requirements
IEC 62304 Class C (medical devices) requires traceable software. DO-178C Level A (aerospace) requires complete requirements traceability. “We shuffled the data randomly” satisfies neither.
With certifiable-data, you can prove:
- Exactly what data was used
- In exactly what order
- With exactly what transformations
- And verify it cryptographically
Debugging Training
When training fails or produces unexpected results, you need to understand what happened. With deterministic data loading, you can replay exact batches and inspect exact transformations. No “it worked differently yesterday” mysteries.
Getting Started
git clone https://github.com/williamofai/certifiable-data
cd certifiable-data
mkdir build && cd build
cmake ..
make
make testExpected output:
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 0.04 secDocumentation
- CT-MATH-001.md — Mathematical foundations (normalisation, Feistel, Merkle)
- CT-STRUCT-001.md — Data structure specifications
- docs/requirements/ — SRS documents with full traceability
Data loading as a pure function. Merkle-proven provenance. GPL-3.0 licensed.