Stochastic Rounding Without the Stochastic

Diagram showing PRNG-controlled stochastic rounding with deterministic outcomes

Stochastic rounding is a technique for training neural networks at low precision. Instead of always rounding 2.7 to 3 and 2.3 to 2, you round probabilistically based on the fractional part. 2.7 rounds to 3 with 70% probability, to 2 with 30% probability.

The benefit: unbiased gradients. Over many operations, the expected value equals the true value. Models train better at low precision.

The problem: randomness. Different random seeds, different training runs. Non-reproducible results.

But here’s the insight: the “random” numbers don’t need to be random. They just need to be unpredictable from the perspective of the computation. A deterministic PRNG, seeded from the operation context, provides the same regularisation effect — with full reproducibility.

The Mechanism

Traditional stochastic rounding:

def stochastic_round(x):
    floor = int(x)
    frac = x - floor
    if random() < frac:
        return floor + 1
    else:
        return floor

For x = 2.7, we generate a random number. If it’s less than 0.7, we round up. If it’s greater than 0.7, we round down. On average, we round up 70% of the time.

Deterministic stochastic rounding:

int32_t ct_stochastic_round(int64_t x, uint32_t shift, 
                            ct_prng_t *prng,
                            ct_fault_flags_t *faults)
{
    int64_t half = 1LL << (shift - 1);
    int64_t mask = (1LL << shift) - 1;
    int64_t frac = x & mask;
    int64_t truncated = x >> shift;
    
    /* Generate threshold from PRNG */
    uint32_t rand = ct_prng_next(prng);
    uint32_t threshold = rand >> (32 - shift);
    
    /* Compare fractional part to threshold */
    if ((uint64_t)frac > threshold) {
        truncated += (x >= 0) ? 1 : -1;
    }
    
    return dvm_clamp32(truncated, faults);
}

The key difference: ct_prng_next(prng) is deterministic. Given the same PRNG state, we get the same “random” number. The rounding decision is reproducible.

The PRNG Design

The counter-based PRNG in certifiable-training is a pure function:

output = f(seed, operation_id, step)

Where:

seed: Fixed for the entire training run
operation_id: Unique per operation (layer × tensor × element)
step: Monotonically increasing counter

typedef struct {
    uint64_t seed;
    uint64_t op_id;
    uint64_t step;
} ct_prng_t;

uint32_t ct_prng_next(ct_prng_t *prng)
{
    /* Combine state components */
    uint64_t x = prng->seed;
    x ^= prng->op_id;
    x += prng->step;
    
    /* Mix with multiply-xorshift */
    x ^= x >> 33;
    x *= 0xFF51AFD7ED558CCDULL;
    x ^= x >> 33;
    x *= 0xC4CEB9FE1A85EC53ULL;
    x ^= x >> 33;
    
    prng->step++;
    return (uint32_t)(x >> 32);
}

The operation_id ensures different elements get different sequences. The step ensures repeated operations on the same element get different values. The seed allows reproducibility: same seed, same training run.

Why This Works

The magic of stochastic rounding isn’t randomness per se — it’s that the rounding errors are uncorrelated with the values being rounded.

Consider gradient accumulation. With truncation (round toward zero), small gradients are systematically lost. A gradient of 0.0001 truncates to 0, every time. Information disappears.

With stochastic rounding, that 0.0001 gradient occasionally rounds to 0.0001/quantum, preserving signal. Not every time — but often enough that the expected value is correct.

The PRNG provides this uncorrelated behaviour. The sequence 0x24F74A49, 0xA96E3F40, 0xC1C8ECFB… has no relation to the gradients being rounded. Statistically, it behaves like true randomness.

But unlike true randomness, we can replay it. Given the same (seed, op_id, step), we get the same sequence. Training is reproducible.

The Tradeoff

True stochastic rounding provides strong theoretical guarantees about unbiasedness. Our deterministic version provides empirically equivalent behaviour with reproducibility.

The difference: with true randomness, you can prove statistical properties from first principles. With a PRNG, you’re relying on the PRNG quality to not introduce systematic bias.

Modern PRNGs pass all standard statistical tests. For practical training, they’re indistinguishable from true randomness. But for formal analysis, you’d need to verify PRNG properties explicitly.

Implementation Details

Operation ID Construction

Each weight update needs a unique operation_id:

uint64_t ct_prng_make_op_id(uint32_t layer_id, 
                            uint32_t tensor_id,
                            uint32_t element_idx)
{
    /* Combine with bit-shifting to avoid collisions */
    return ((uint64_t)layer_id << 48) |
           ((uint64_t)tensor_id << 32) |
           (uint64_t)element_idx;
}

Layer 3, tensor 1, element 42 → unique op_id. Layer 3, tensor 1, element 43 → different op_id. Different sequences, uncorrelated rounding decisions.

Step Synchronisation

The step counter must advance consistently across platforms:

/* Wrong: step based on loop iteration */
for (int i = 0; i < batch_size; i++) {
    ct_prng_init(&prng, seed, op_id);
    prng.step = global_step * batch_size + i;  /* Platform-dependent ordering */
    ...
}

/* Right: step derived from operation context */
for (int i = 0; i < batch_size; i++) {
    uint64_t step = global_step * max_batch_size + i;
    ct_prng_init(&prng, seed, op_id);
    prng.step = step;
    ...
}

The max_batch_size constant ensures the same step values regardless of actual batch size. No platform-dependent ordering.

Fixed-Point Integration

Stochastic rounding integrates with the Q16.16 arithmetic:

/* Multiply two Q16.16 values with stochastic rounding */
int32_t mul_sr(int32_t a, int32_t b, ct_prng_t *prng, 
               ct_fault_flags_t *faults)
{
    int64_t product = (int64_t)a * (int64_t)b;
    /* Product is Q32.32, need to round back to Q16.16 */
    return ct_stochastic_round(product, 16, prng, faults);
}

The 64-bit intermediate preserves precision. The stochastic round preserves expected value while reducing to 32 bits.

When to Use Stochastic Rounding

Use it for:

Gradient accumulation (small gradients matter)
Weight updates (preserve small changes)
Activation functions (reduce quantisation noise)

Don’t use it for:

Loss calculation (deterministic RNE preferred)
Inference (reproducibility more important than regularisation)
Checkpointing (exact values needed for resumption)

The certifiable-* ecosystem uses stochastic rounding selectively — during training where regularisation helps, RNE elsewhere for strict determinism.

Test Vectors

For compliance, these test cases must pass:

ct_prng_t prng;
ct_fault_flags_t faults = {0};

/* Seed 0, op_id 0: verify sequence */
ct_prng_init(&prng, 0, 0);
assert(ct_prng_next(&prng) == 0x24F74A49);
assert(ct_prng_next(&prng) == 0xA96E3F40);

/* Same seed, same op_id: same sequence */
ct_prng_t prng2;
ct_prng_init(&prng2, 0, 0);
assert(ct_prng_next(&prng2) == 0x24F74A49);

/* Different op_id: different sequence */
ct_prng_init(&prng, 0, 1);
assert(ct_prng_next(&prng) != 0x24F74A49);

/* Stochastic round determinism */
ct_prng_init(&prng, 12345, 500);
ct_prng_t prng_copy = prng;
int32_t r1 = ct_stochastic_round(0x18000LL, 16, &prng, &faults);
int32_t r2 = ct_stochastic_round(0x18000LL, 16, &prng_copy, &faults);
assert(r1 == r2);

The Broader Picture

Stochastic rounding is one example of a broader principle: “random” behaviours in ML often don’t need true randomness. They need unpredictability — which a good PRNG provides deterministically.

Dropout? PRNG-controlled. Data augmentation? PRNG-controlled. Initialisation? PRNG-controlled.

Seed everything from a single master seed, derive operation-specific sequences, and the entire training run becomes reproducible. Different seed → different run. Same seed → identical run.

For safety-critical systems, this is more than convenient. It’s a requirement. When you claim “this model was trained with these hyperparameters,” you need to be able to prove it. Reproducibility is the foundation of that proof.

Conclusion

Stochastic rounding demonstrates that determinism and “randomness” aren’t opposites. The regularisation benefits of stochastic methods come from uncorrelated errors, not from true randomness.

A well-designed PRNG provides uncorrelated sequences. Seeded from operation context, it provides reproducibility. The training benefits of stochastic rounding, without the reproducibility costs.

The certifiable-* ecosystem uses this approach throughout. Random-looking behaviour, fully deterministic implementation. Train on Monday, train on Friday, get identical results.

As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context. For systems where reproducibility is mandatory, PRNG-controlled stochastic rounding provides the best of both worlds.

Explore the PRNG implementation in certifiable-training or see the test vectors in the CT-MATH-001 specification.

Stochastic Rounding Without the Stochastic

The Mechanism

The PRNG Design

Why This Works

The Tradeoff

Implementation Details

Operation ID Construction

Step Synchronisation

Fixed-Point Integration

When to Use Stochastic Rounding

Test Vectors

The Broader Picture

Conclusion

About the Author

Discuss This Perspective

Stochastic Rounding Without the Stochastic

The Mechanism

The PRNG Design

Why This Works

The Tradeoff

Implementation Details

Operation ID Construction

Step Synchronisation

Fixed-Point Integration

When to Use Stochastic Rounding

Test Vectors

The Broader Picture

Conclusion

About the Author

Occasional Technical Updates

Discuss This Perspective