Certifiable Data

Standard ML data pipelines are a major source of non-determinism. Floating-point normalisation varies across platforms. Random shuffling produces different orders each run. Data augmentation introduces uncontrolled variation.

When you can’t reproduce your data pipeline, you can’t reproduce your training. When you can’t reproduce your training, you can’t certify your model.

certifiable-data makes data loading a pure function: B_t = Pipeline(D, seed, epoch, t). Given the same dataset, seed, and indices, you get the same batch — bit for bit, every time.

The Problem

Consider a typical PyTorch data loader:

loader = DataLoader(dataset, shuffle=True, num_workers=4)

This single line introduces multiple sources of non-determinism:

Shuffle order depends on PRNG state, which depends on when you called it
Floating-point normalisation varies by platform
Worker processes may return batches in different orders
Augmentation (if any) introduces random transformations

For research, this doesn’t matter. For safety-critical systems, it’s disqualifying.

The Solution

Deterministic Normalisation

Standard normalisation uses floating-point: y = (x - mean) / std. The division introduces platform-dependent rounding.

certifiable-data uses fixed-point with precomputed inverse:

// Q16.16 fixed-point normalisation
// y = (x - μ) * (1/σ)
// All operations use DVM primitives with RNE rounding

int32_t normalise(int32_t x, int32_t mean, int32_t inv_std, 
                  ct_fault_flags_t *faults) {
    int64_t diff = (int64_t)x - (int64_t)mean;
    int64_t product = diff * (int64_t)inv_std;
    return dvm_round_shift_rne(product, 16, faults);
}

The result is deterministic because:

All arithmetic is integer (no floating-point)
Rounding uses Round-to-Nearest-Even (RNE), explicitly
Overflow is handled by saturation with fault flags

Feistel Shuffling

Standard shuffling (Fisher-Yates) requires sequential access and maintains internal state. Different execution orders produce different shuffles.

We use a Cycle-Walking Feistel network — a cryptographic permutation that maps any index to its shuffled position in O(1) time:

uint32_t permute_index(uint32_t index, uint32_t N, 
                       uint64_t seed, uint32_t epoch) {
    // Feistel network with cycle-walking
    // π: [0, N-1] → [0, N-1] (bijection)
    // Same (seed, epoch, index) → same output, always
}

Test vectors from CT-MATH-001 §7.2:

N=100, seed=0x123456789ABCDEF0, index=0 → 26
N=100, seed=0x123456789ABCDEF0, index=99 → 41
N=60000, seed=0xFEDCBA9876543210, index=0 → 26382

The permutation is a true bijection — every input maps to exactly one output, and every output comes from exactly one input.

Deterministic Augmentation

Data augmentation typically uses random number generators. We use counter-based PRNG with explicit operation IDs:

// Horizontal flip (50% probability)
uint64_t rng = ct_prng(seed, epoch, sample_idx << 16 | OP_FLIP);
bool flip = (rng & 1);

// Random crop
uint32_t crop_x = ct_prng_uniform(seed, epoch, 
                                   sample_idx << 16 | OP_CROP_X, 
                                   max_x + 1);

Same (seed, epoch, sample_idx) produces same augmentation. No hidden state.

Merkle Provenance

Every epoch produces a cryptographic commitment:

h_0 = SHA256(0x03 || H_dataset || H_config || seed)
h_e = SHA256(0x04 || h_{e-1} || H_epoch || e)

You can prove exactly what data was used for training. Any tampering breaks the chain.

142

Tests

8/8

Test Suites

O(1)

Shuffle Time

RNE

Rounding

What’s Implemented

All core modules complete — 142 tests passing across 8 test suites:

Module	Tests	Coverage
DVM Primitives	38	CT-MATH-001 §3 test vectors
PRNG	13	Determinism, distribution quality
Shuffle	19	Bijection verification
Normalise	13	Correctness, overflow handling
Augment	10	Flip, crop, noise
Batch	12	Construction, verification
Merkle	20	Hashing, provenance chain
Bit-Identity	17	Cross-platform verification

Usage Example

#include "ct_types.h"
#include "loader.h"
#include "normalize.h"
#include "shuffle.h"
#include "batch.h"
#include "merkle.h"

// Pre-allocated buffers
ct_sample_t dataset_samples[60000];
ct_dataset_t dataset = {
    .samples = dataset_samples,
    .num_samples = 60000
};

ct_fault_flags_t faults = {0};

// Load data (deterministic decimal parsing)
ct_load_csv("mnist.csv", &dataset, &faults);

// Setup normalisation
ct_normalize_ctx_t norm_ctx;
ct_normalize_init(&norm_ctx, means, inv_stds, 784);

// Create batch via deterministic shuffle
ct_batch_t batch;
ct_batch_init(&batch, batch_samples, batch_hashes, 32);
ct_batch_fill(&batch, &dataset, batch_index, epoch, seed);

// Verify integrity
int valid = ct_batch_verify(&batch);

// Initialize provenance chain
ct_provenance_t prov;
ct_provenance_init(&prov, dataset_hash, config_hash, seed);

The Complete Pipeline

certifiable-data completes the deterministic ML pipeline:

certifiable-data → certifiable-training → certifiable-inference
     ↓                    ↓                      ↓
  Load data         Train model            Deploy model
  Normalise         Merkle chain           Bit-perfect
  Shuffle           Audit trail            inference
  Batch

Every step is deterministic. Every step is auditable. The same seed produces the same model produces the same predictions, forever.

Why It Matters

Reproducibility Crisis

The ML reproducibility crisis is well-documented. Papers can’t be replicated. Models can’t be reconstructed. Part of the problem is non-deterministic data pipelines — you can’t reproduce training if you can’t reproduce the exact data order.

Certification Requirements

IEC 62304 Class C (medical devices) requires traceable software. DO-178C Level A (aerospace) requires complete requirements traceability. “We shuffled the data randomly” satisfies neither.

With certifiable-data, you can prove:

Exactly what data was used
In exactly what order
With exactly what transformations
And verify it cryptographically

Debugging Training

When training fails or produces unexpected results, you need to understand what happened. With deterministic data loading, you can replay exact batches and inspect exact transformations. No “it worked differently yesterday” mysteries.

Getting Started

git clone https://github.com/williamofai/certifiable-data
cd certifiable-data
mkdir build && cd build
cmake ..
make
make test

Expected output:

100% tests passed, 0 tests failed out of 8
Total Test time (real) = 0.04 sec

Documentation

CT-MATH-001.md — Mathematical foundations (normalisation, Feistel, Merkle)
CT-STRUCT-001.md — Data structure specifications
docs/requirements/ — SRS documents with full traceability

Data loading as a pure function. Merkle-proven provenance. GPL-3.0 licensed.

Certifiable Data

The Problem

The Solution

Deterministic Normalisation

Feistel Shuffling

Deterministic Augmentation

Merkle Provenance

What’s Implemented

Usage Example

The Complete Pipeline

Why It Matters

Reproducibility Crisis

Certification Requirements

Debugging Training

Getting Started

Documentation

About the Author

Questions or Contributions?

Certifiable Data

The Problem

The Solution

Deterministic Normalisation

Feistel Shuffling

Deterministic Augmentation

Merkle Provenance

What’s Implemented

Usage Example

The Complete Pipeline

Why It Matters

Reproducibility Crisis

Certification Requirements

Debugging Training

Getting Started

Documentation

About the Author

Occasional Technical Updates

Questions or Contributions?