Why Your ML Model Gives Different Results Every Tuesday

The Non-Determinism Problem - Seven sources of ML non-reproducibility

You train a model on Monday. Validation accuracy: 94.2%. Satisfied, you go home.

Tuesday morning, a colleague re-runs the same training script, same data, same hyperparameters. Validation accuracy: 93.8%.

“That’s within normal variance,” everyone agrees. You deploy Monday’s model.

Six months later, during an audit, someone asks: “Can you reproduce this model?” You run the original script. Validation accuracy: 94.1%. Close, but the weights are different. The predictions on edge cases are different. The model you’re auditing isn’t the model you deployed.

This isn’t a story about sloppy engineering. This is the reality of machine learning in 2026. And for safety-critical applications—autonomous vehicles, medical devices, aerospace systems—it’s becoming a serious problem.

The Myth of Random Seeds

The standard advice is simple: “Set your random seed for reproducibility.”

import torch
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

This creates a comforting illusion. The same seed should produce the same sequence of “random” numbers, which should produce the same weight initialisation, the same batch ordering, the same dropout masks—and therefore the same trained model.

Except it doesn’t.

Setting the seed is necessary but nowhere near sufficient. There are at least seven sources of non-determinism that seeds don’t control, and most ML practitioners encounter them without realising it.

Source 1: Floating-Point Ordering

Consider a simple sum: a + b + c + d

In exact arithmetic, the order doesn’t matter. In floating-point arithmetic, it absolutely does:

(a + b) + (c + d) ≠ ((a + c) + b) + d

The differences are tiny—perhaps the 15th decimal place. But neural networks perform billions of these operations during training. Tiny errors accumulate. By the end of training, weights can differ meaningfully.

Modern GPUs make this worse. To maximise throughput, they process operations in parallel and accumulate results in non-deterministic order. The same operation on the same GPU can produce different results on different runs—not because of randomness, but because of scheduling variations at the microsecond level.

Source 2: GPU Non-Determinism

NVIDIA’s cuDNN library—the backbone of deep learning on GPUs—uses algorithms that are non-deterministic by default. Operations like convolution and pooling have multiple valid implementation strategies. cuDNN selects between them based on runtime profiling, cache state, and GPU occupancy.

You can force deterministic mode:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

But this comes with a 10-30% performance penalty. Most teams disable it for training speed, accepting non-reproducibility as the cost.

Source 3: Data Loading Parallelism

Modern training pipelines load data in parallel:

DataLoader(dataset, num_workers=8, shuffle=True)

With multiple workers, the order in which batches arrive depends on operating system scheduling, disk I/O timing, and CPU load. The shuffle happens correctly, but the precise sequence of batches varies between runs.

“But I set shuffle=True with a seed!” Yes, and the shuffle itself is reproducible. But which worker finishes first? That’s determined by factors outside your control.

Source 4: Batch Normalisation Statistics

Batch normalisation layers compute running statistics during training:

running_mean = momentum * running_mean + (1 - momentum) * batch_mean

These statistics depend on the exact sequence of batches seen. If batch ordering varies (Source 3), running statistics vary. At inference time, the model behaves differently because its normalisation parameters are different.

Source 5: Dropout and Stochastic Layers

Dropout masks are generated from the random state. If anything perturbs that state—a different batch order, a parallel data loading race, even a print statement that triggers formatting code using random numbers—the dropout masks change.

The same applies to any stochastic layer: variational layers, noise injection, stochastic depth. Each is a potential divergence point.

Source 6: Hardware Differences

Train on an A100 GPU. Deploy on a T4. Even with deterministic flags enabled, different GPU architectures have different floating-point implementations. Results will differ.

Train on GPU. Run inference on CPU. The differences can be even larger—different instruction sets, different SIMD widths, different handling of denormals.

“But the differences are tiny!” Yes. And “tiny” can mean a different classification on edge cases. In medical imaging, an edge case might be a tumour.

Source 7: Library Versions

PyTorch 2.0 changed the default random number generator algorithm. Models trained with PyTorch 1.x cannot be exactly reproduced with PyTorch 2.x—even with identical seeds. The “same” seed produces different numbers.

cuDNN updates routinely change which algorithms are selected. CUDA versions affect floating-point behaviour. Even NumPy has changed its random number generation across versions.

The model you trained 18 months ago, with pinned library versions? Good luck recreating that environment exactly.

Why This Matters Now

For research and commercial ML, non-reproducibility is an inconvenience. You run experiments multiple times and report averages. You accept some variance in production model quality.

For safety-critical applications, non-reproducibility is a fundamental obstacle.

Certification Requirements

DO-178C (aerospace software) requires traceability from requirements to code to test evidence. If the same training process produces different models, which model are you certifying? Can you prove that the deployed model is the one you tested?

IEC 62304 (medical device software) requires documented software development processes. If training is non-deterministic, how do you document what the process produced? How do you respond when a regulator asks to reproduce a result?

ISO 26262 (automotive functional safety) requires systematic approaches to avoid unreasonable risk. How do you argue that your ML component is safe if you cannot reproduce its development?

The EU AI Act

The EU AI Act, coming into force progressively through 2025-2027, mandates technical documentation demonstrating how AI systems were developed. High-risk systems—including those in medical devices, vehicles, and critical infrastructure—require particularly rigorous documentation.

Article 11 requires documentation of “the design specifications of the system, in particular the general logic of the AI system and of the algorithms.”

Article 12 requires “automatic recording of events” enabling tracing of the system’s operation.

Non-deterministic training makes both requirements difficult to satisfy. How do you document the logic of a model whose weights depend on GPU scheduling? How do you trace operation when you cannot reproduce the artefact?

Audit and Liability

When an autonomous vehicle causes an accident, lawyers will ask: “Show us the exact model that made this decision, and prove how it was trained.”

When a diagnostic AI misses a cancer, regulators will ask: “Reproduce the training process and show us that the deployed model matches your validation results.”

If you cannot reproduce training exactly, you cannot answer these questions definitively. You’re left arguing that the model was “probably similar” to what you tested. That’s not a strong legal position.

The Gap

The machine learning community has made extraordinary progress on model architectures, training techniques, and deployment infrastructure. Reproducibility has received comparatively little attention.

This creates a gap: ML capabilities are racing ahead, while ML auditability remains stuck. The models get more powerful; the ability to certify them does not keep pace.

For non-safety applications, this gap is acceptable. For safety-critical applications, it’s becoming a blocker. Companies are building increasingly capable AI systems that they cannot certify for deployment in regulated environments.

The question is not whether deterministic ML is desirable. It obviously is. The question is whether it’s achievable.

The Path Forward

Every source of non-determinism listed above has a solution. The solutions are not trivial, but they exist.

Floating-point ordering can be controlled through careful algorithm design—fixed accumulation orders, Kahan summation, or integer-based computation for critical paths.

GPU non-determinism can be eliminated by avoiding non-deterministic operations or implementing deterministic alternatives.

Data loading parallelism can be made deterministic through careful synchronisation and pre-computed batch orderings.

Batch normalisation can use deterministic running statistics or be replaced with alternatives like layer normalisation.

Dropout can use pre-computed masks derived from cryptographic hash functions rather than stateful random number generators.

Hardware differences can be minimised through fixed-point arithmetic or careful floating-point discipline.

Library dependencies can be controlled through rigorous environment management and, ultimately, through purpose-built frameworks that don’t share codepaths with non-deterministic implementations.

None of this is easy. All of it is possible. The result is ML training where:

Same data + same seed = identical model
Every weight is traceable to its initialisation and training history
Training can be replayed for audit or debugging
Certification evidence is reproducible

This is what deterministic machine learning looks like. Not “mostly reproducible.” Not “reproducible within tolerance.” Bit-for-bit identical, on any hardware, years later.

Implications

If deterministic ML is achievable, several implications follow:

Certification becomes tractable. You can train a model, validate it, and certify the specific artefact you validated. Reproduce it for audit. Demonstrate compliance with documentation requirements.

Debugging becomes possible. When a model misbehaves, you can replay training to the exact step where problematic behaviour emerged. Step through the learning process. Identify causal factors.

Liability becomes manageable. You can prove what model was deployed. Reproduce its development. Answer regulatory questions definitively rather than probabilistically.

Development becomes faster. No more “run it five times to see if the result is real.” Train once, evaluate once, trust the result.

The safety-critical AI market is constrained not by model capability but by certification capability. Deterministic ML removes that constraint.

The Question

This article has described a problem. Solutions exist. The technology to build deterministic ML frameworks is available today.

The question for anyone building AI for safety-critical applications: how much longer can you afford to deploy systems you cannot reproduce?

SpeyTech develops deterministic computing infrastructure for safety-critical systems. Our deterministic ML and RL frameworks achieve bit-perfect reproducibility across hardware platforms. Patents GB2521625.0 and GB2522369.4. For technical discussions, contact william@fstopify.com.

Why Your ML Model Gives Different Results Every Tuesday

The Myth of Random Seeds

Source 1: Floating-Point Ordering

Source 2: GPU Non-Determinism

Source 3: Data Loading Parallelism

Source 4: Batch Normalisation Statistics

Source 5: Dropout and Stochastic Layers

Source 6: Hardware Differences

Source 7: Library Versions

Why This Matters Now

The Gap

The Path Forward

Implications

The Question

About the Author

Discuss This Perspective

Why Your ML Model Gives Different Results Every Tuesday

The Myth of Random Seeds

Source 1: Floating-Point Ordering

Source 2: GPU Non-Determinism

Source 3: Data Loading Parallelism

Source 4: Batch Normalisation Statistics

Source 5: Dropout and Stochastic Layers

Source 6: Hardware Differences

Source 7: Library Versions

Why This Matters Now

The Gap

The Path Forward

Implications

The Question

About the Author

Occasional Technical Updates

Discuss This Perspective