Deterministic ML GPL-3.0

Certifiable Data

Deterministic data pipelines for safety-critical ML — because 'we shuffled the data' isn't reproducible

GitHub Repository
Published
January 16, 2026 10:00
Reading Time
3 min
Data pipeline showing deterministic loading, normalization, shuffling, and batching with Merkle provenance

Standard ML data pipelines are a major source of non-determinism. Floating-point normalisation varies across platforms. Random shuffling produces different orders each run. Data augmentation introduces uncontrolled variation.

When you can’t reproduce your data pipeline, you can’t reproduce your training. When you can’t reproduce your training, you can’t certify your model.

certifiable-data makes data loading a pure function: B_t = Pipeline(D, seed, epoch, t). Given the same dataset, seed, and indices, you get the same batch — bit for bit, every time.

The Problem

Consider a typical PyTorch data loader:

loader = DataLoader(dataset, shuffle=True, num_workers=4)

This single line introduces multiple sources of non-determinism:

  1. Shuffle order depends on PRNG state, which depends on when you called it
  2. Floating-point normalisation varies by platform
  3. Worker processes may return batches in different orders
  4. Augmentation (if any) introduces random transformations

For research, this doesn’t matter. For safety-critical systems, it’s disqualifying.

The Solution

Deterministic Normalisation

Standard normalisation uses floating-point: y = (x - mean) / std. The division introduces platform-dependent rounding.

certifiable-data uses fixed-point with precomputed inverse:

// Q16.16 fixed-point normalisation
// y = (x - μ) * (1/σ)
// All operations use DVM primitives with RNE rounding

int32_t normalise(int32_t x, int32_t mean, int32_t inv_std, 
                  ct_fault_flags_t *faults) {
    int64_t diff = (int64_t)x - (int64_t)mean;
    int64_t product = diff * (int64_t)inv_std;
    return dvm_round_shift_rne(product, 16, faults);
}

The result is deterministic because:

  • All arithmetic is integer (no floating-point)
  • Rounding uses Round-to-Nearest-Even (RNE), explicitly
  • Overflow is handled by saturation with fault flags

Feistel Shuffling

Standard shuffling (Fisher-Yates) requires sequential access and maintains internal state. Different execution orders produce different shuffles.

We use a Cycle-Walking Feistel network — a cryptographic permutation that maps any index to its shuffled position in O(1) time:

uint32_t permute_index(uint32_t index, uint32_t N, 
                       uint64_t seed, uint32_t epoch) {
    // Feistel network with cycle-walking
    // π: [0, N-1] → [0, N-1] (bijection)
    // Same (seed, epoch, index) → same output, always
}

Test vectors from CT-MATH-001 §7.2:

N=100, seed=0x123456789ABCDEF0, index=0 → 26
N=100, seed=0x123456789ABCDEF0, index=99 → 41
N=60000, seed=0xFEDCBA9876543210, index=0 → 26382

The permutation is a true bijection — every input maps to exactly one output, and every output comes from exactly one input.

Deterministic Augmentation

Data augmentation typically uses random number generators. We use counter-based PRNG with explicit operation IDs:

// Horizontal flip (50% probability)
uint64_t rng = ct_prng(seed, epoch, sample_idx << 16 | OP_FLIP);
bool flip = (rng & 1);

// Random crop
uint32_t crop_x = ct_prng_uniform(seed, epoch, 
                                   sample_idx << 16 | OP_CROP_X, 
                                   max_x + 1);

Same (seed, epoch, sample_idx) produces same augmentation. No hidden state.

Merkle Provenance

Every epoch produces a cryptographic commitment:

h_0 = SHA256(0x03 || H_dataset || H_config || seed)
h_e = SHA256(0x04 || h_{e-1} || H_epoch || e)

You can prove exactly what data was used for training. Any tampering breaks the chain.

142
Tests
8/8
Test Suites
O(1)
Shuffle Time
RNE
Rounding

What’s Implemented

All core modules complete — 142 tests passing across 8 test suites:

ModuleTestsCoverage
DVM Primitives38CT-MATH-001 §3 test vectors
PRNG13Determinism, distribution quality
Shuffle19Bijection verification
Normalise13Correctness, overflow handling
Augment10Flip, crop, noise
Batch12Construction, verification
Merkle20Hashing, provenance chain
Bit-Identity17Cross-platform verification

Usage Example

#include "ct_types.h"
#include "loader.h"
#include "normalize.h"
#include "shuffle.h"
#include "batch.h"
#include "merkle.h"

// Pre-allocated buffers
ct_sample_t dataset_samples[60000];
ct_dataset_t dataset = {
    .samples = dataset_samples,
    .num_samples = 60000
};

ct_fault_flags_t faults = {0};

// Load data (deterministic decimal parsing)
ct_load_csv("mnist.csv", &dataset, &faults);

// Setup normalisation
ct_normalize_ctx_t norm_ctx;
ct_normalize_init(&norm_ctx, means, inv_stds, 784);

// Create batch via deterministic shuffle
ct_batch_t batch;
ct_batch_init(&batch, batch_samples, batch_hashes, 32);
ct_batch_fill(&batch, &dataset, batch_index, epoch, seed);

// Verify integrity
int valid = ct_batch_verify(&batch);

// Initialize provenance chain
ct_provenance_t prov;
ct_provenance_init(&prov, dataset_hash, config_hash, seed);

The Complete Pipeline

certifiable-data completes the deterministic ML pipeline:

certifiable-data → certifiable-training → certifiable-inference
     ↓                    ↓                      ↓
  Load data         Train model            Deploy model
  Normalise         Merkle chain           Bit-perfect
  Shuffle           Audit trail            inference
  Batch

Every step is deterministic. Every step is auditable. The same seed produces the same model produces the same predictions, forever.

Why It Matters

Reproducibility Crisis

The ML reproducibility crisis is well-documented. Papers can’t be replicated. Models can’t be reconstructed. Part of the problem is non-deterministic data pipelines — you can’t reproduce training if you can’t reproduce the exact data order.

Certification Requirements

IEC 62304 Class C (medical devices) requires traceable software. DO-178C Level A (aerospace) requires complete requirements traceability. “We shuffled the data randomly” satisfies neither.

With certifiable-data, you can prove:

  • Exactly what data was used
  • In exactly what order
  • With exactly what transformations
  • And verify it cryptographically

Debugging Training

When training fails or produces unexpected results, you need to understand what happened. With deterministic data loading, you can replay exact batches and inspect exact transformations. No “it worked differently yesterday” mysteries.

Getting Started

git clone https://github.com/williamofai/certifiable-data
cd certifiable-data
mkdir build && cd build
cmake ..
make
make test

Expected output:

100% tests passed, 0 tests failed out of 8
Total Test time (real) = 0.04 sec

Documentation

  • CT-MATH-001.md — Mathematical foundations (normalisation, Feistel, Merkle)
  • CT-STRUCT-001.md — Data structure specifications
  • docs/requirements/ — SRS documents with full traceability

Data loading as a pure function. Merkle-proven provenance. GPL-3.0 licensed.

About the Author

William Murray is a Regenerative Systems Architect with 30 years of UNIX infrastructure experience, specializing in deterministic computing for safety-critical systems. Based in the Scottish Highlands, he operates SpeyTech and maintains several open-source projects including C-Sentinel and c-from-scratch.

Questions or Contributions?

Open an issue on GitHub or get in touch directly.

View on GitHub Contact
← Back to Open Source