When Fixed-Point Beats Floating-Point (And When It Doesn't)

Side-by-side comparison of floating-point and fixed-point representations showing their respective advantages and challenges

The certifiable-* ecosystem uses Q16.16 fixed-point arithmetic throughout. This isn’t because floating-point is inherently evil - it’s because the specific properties of fixed-point arithmetic align with the specific requirements of safety-critical systems.

But fixed-point has real costs. Limited range. Manual overflow handling. More development effort. For many applications, these costs outweigh the benefits.

This article provides an honest framework for deciding when fixed-point is worth it.

What Fixed-Point Actually Is

Fixed-point arithmetic represents fractional numbers using integers with an implicit scale factor. In Q16.16 format:

Raw integer value: 0x00018000 (98,304)
Interpreted value: 98,304 / 65,536 = 1.5

The “Q16.16” notation means 16 bits for the integer part and 16 bits for the fractional part. A 32-bit signed integer represents values from approximately -32,768 to +32,767, with precision of 1/65,536 ≈ 0.000015.

Basic Operations

Addition and subtraction work directly on the raw integers:

// Q16.16 addition: just integer addition
q16_t q16_add(q16_t a, q16_t b) {
    return a + b;  // Same scale factor, direct addition
}

// Example: 1.5 + 2.25 = 3.75
// 98,304 + 147,456 = 245,760
// 245,760 / 65,536 = 3.75 ✓

Multiplication requires a shift to maintain the scale:

// Q16.16 multiplication
q16_t q16_mul(q16_t a, q16_t b) {
    int64_t product = (int64_t)a * (int64_t)b;
    return (q16_t)(product >> 16);  // Divide by scale factor
}

// Example: 1.5 × 2.0 = 3.0
// 98,304 × 131,072 = 12,884,901,888
// 12,884,901,888 >> 16 = 196,608
// 196,608 / 65,536 = 3.0 ✓

Division is the inverse:

// Q16.16 division
q16_t q16_div(q16_t a, q16_t b) {
    int64_t scaled = (int64_t)a << 16;  // Multiply by scale factor
    return (q16_t)(scaled / b);
}

These operations are pure integer arithmetic. No floating-point unit required. No hardware-dependent rounding modes.

Why Floating-Point Is Non-Deterministic

IEEE 754 floating-point seems deterministic - the standard specifies exact behaviour. But the specification has gaps, and hardware implementations differ.

Rounding Mode Variability

The IEEE 754 standard defines five rounding modes:

Round to nearest, ties to even (default)
Round to nearest, ties away from zero
Round toward zero
Round toward positive infinity
Round toward negative infinity

The default mode varies by platform. x86 uses “round to nearest, ties to even” while some ARM implementations use “round to nearest, ties away from zero” in certain configurations. Same computation, different results.

Extended Precision

x87 floating-point (used on older x86 processors) uses 80-bit extended precision internally, then rounds to 64-bit for storage. This means:

double a = /* some value */;
double b = /* some value */;
double c = a * b;  // Computed in 80-bit, stored as 64-bit

// vs.

double temp = a * b;  // Computed in 80-bit, stored as 64-bit
double c = temp;      // May or may not reload from memory

The compiler’s register allocation decisions affect numerical results. Same source code, different binaries, different results.

Fused Multiply-Add

Modern processors support FMA (fused multiply-add) instructions that compute a * b + c with a single rounding at the end, rather than rounding after the multiply and again after the add.

double fma_result = a * b + c;  // Might use FMA (one rounding)
double separate = (a * b) + c;   // Might not (two roundings)

Whether the compiler uses FMA depends on optimisation level, target architecture, and compiler version. Same expression, different binary, different result.

Associativity Failures

Floating-point addition is not associative:

double a = 1e-16;
double b = 1.0;
double c = -1.0;

double x = (a + b) + c;  // = 0.0 (1e-16 absorbed by 1.0)
double y = a + (b + c);  // = 1e-16 (b + c = 0, then add a)

When compilers reorder operations for optimisation, results change. Parallel reduction (summing array elements across multiple threads) is particularly vulnerable - the order of operations affects the result.

The Practical Impact

For most applications, these differences are negligible - they affect the least significant bits. But for safety-critical systems:

Different platforms produce different results
The same platform may produce different results across compiler versions
Test results on development machines don’t guarantee production behaviour
Debugging is harder when results aren’t reproducible

This is why fixed-point matters.

Why Fixed-Point Is Deterministic

Fixed-point arithmetic uses only integer operations, which are fully deterministic across all platforms:

// Integer addition: always the same
int32_t a = 100;
int32_t b = 200;
int32_t c = a + b;  // Always 300, everywhere, forever

Deterministic Rounding

The one place where fixed-point requires decisions is the shift in multiplication:

int64_t product = (int64_t)a * (int64_t)b;
return (q16_t)(product >> 16);  // Truncation

Simple right-shift truncates toward negative infinity. This is deterministic but introduces bias. The solution is explicit round-to-nearest-even (RNE):

q16_t q16_mul_rne(q16_t a, q16_t b) {
    int64_t product = (int64_t)a * (int64_t)b;
    
    // Round to nearest, ties to even
    int64_t round_bit = (product >> 15) & 1;
    int64_t guard_bit = (product >> 14) & 1;
    int64_t sticky = (product & 0x3FFF) != 0;
    
    int64_t round_up = round_bit & (guard_bit | sticky);
    return (q16_t)((product >> 16) + round_up);
}

This is more code than truncation, but it’s pure integer arithmetic, fully deterministic, and produces the same result on every platform.

For details on why RNE matters, see Round-to-Nearest-Even: The Rounding Mode That Makes Determinism Possible.

No Special Values

Floating-point has special values: positive/negative infinity, positive/negative zero, NaN (Not a Number), denormalised numbers. Each creates edge cases:

double x = 0.0 / 0.0;   // NaN
double y = 1.0 / 0.0;   // Infinity
double z = x == x;      // false (NaN != NaN)

Fixed-point has none of this. Overflow produces defined (if incorrect) results. Division by zero is undefined behaviour that can be guarded against. There’s no NaN to propagate silently through computations.

The Range-Precision Trade-off

Fixed-point’s determinism comes at a cost: limited range.

Q16.16 Limits

Maximum value:  32,767.999985 (0x7FFFFFFF)
Minimum value: -32,768.0      (0x80000000)
Precision:      0.000015      (1/65536)

For neural network weights typically in the range [-2, 2], this is ample precision. For financial calculations involving millions of dollars, it’s insufficient.

Choosing the Format

Different applications need different trade-offs:

Format	Integer Bits	Fraction Bits	Range	Precision
Q8.24	8	24	±127	0.00000006
Q16.16	16	16	±32,767	0.000015
Q24.8	24	8	±8,388,607	0.004
Q31.1	31	1	±1,073,741,823	0.5

High precision, low range (Q8.24): Signal processing, audio, small neural networks with normalised weights.

Balanced (Q16.16): General-purpose, neural network inference, control systems.

High range, low precision (Q24.8): Financial calculations with large integers, counters, timestamps.

The format choice is a design decision that must be made upfront and documented.

Overflow Handling

Fixed-point overflow wraps silently:

q16_t max = 0x7FFFFFFF;  // 32,767.999985
q16_t result = max + 1;  // 0x80000000 = -32,768.0 (wrapped!)

This must be handled explicitly:

q16_t q16_add_saturate(q16_t a, q16_t b) {
    int64_t sum = (int64_t)a + (int64_t)b;
    if (sum > INT32_MAX) return INT32_MAX;
    if (sum < INT32_MIN) return INT32_MIN;
    return (q16_t)sum;
}

Saturation arithmetic clamps to the maximum/minimum rather than wrapping. This is often the right behaviour for control systems - better to hit a limit than wrap to the opposite extreme.

The Decision Framework

Here’s when to choose fixed-point:

Use Fixed-Point When:

Determinism is required. Safety-critical systems, certified software, reproducible research, bit-exact testing across platforms.

Certification is involved. DO-178C, IEC 62304, ISO 26262 all favour analysable arithmetic. Fixed-point’s bounded, deterministic behaviour is easier to certify.

No FPU is available. Embedded systems, microcontrollers, and some safety-critical processors lack floating-point hardware.

Cross-platform reproducibility matters. If test results must match production exactly, floating-point is risky.

Values are bounded. If you know your values stay within ±32,767 (or whatever format you choose), fixed-point works well.

Use Floating-Point When:

Dynamic range is large. Values spanning 10⁻³⁰ to 10³⁰ require floating-point’s exponential representation.

Precision varies by magnitude. Floating-point maintains relative precision across scales; fixed-point has constant absolute precision.

Performance on GPU matters. GPU floating-point is highly optimised; fixed-point often requires workarounds.

The application isn’t safety-critical. For research, prototyping, and non-critical applications, floating-point’s convenience outweighs its non-determinism.

Library support is needed. Scientific computing libraries assume floating-point. Translating to fixed-point may not be worth the effort.

Hybrid Approaches

Many systems use both:

Training in floating-point, inference in fixed-point. Research and development use convenient floating-point. Deployment converts to deterministic fixed-point. This is the approach of certifiable-quant.

Critical path in fixed-point, peripherals in floating-point. The safety-critical control loop uses deterministic fixed-point. Logging, visualisation, and non-critical features use floating-point.

Fixed-point with floating-point validation. Implement in fixed-point, validate against a floating-point reference to catch overflow issues.

Fixed-Point in Practice

Neural Network Inference

The certifiable-inference project demonstrates fixed-point neural networks. The key insight: neural network weights and activations are typically small values (after normalisation), fitting comfortably in Q16.16 range.

// Convolution in Q16.16
q16_t conv2d_output(const q16_t *input, const q16_t *kernel,
                    int x, int y, int kw, int kh) {
    int64_t sum = 0;  // Accumulate in 64-bit to avoid overflow
    
    for (int ky = 0; ky < kh; ky++) {
        for (int kx = 0; kx < kw; kx++) {
            int64_t in = input[(y + ky) * width + (x + kx)];
            int64_t wt = kernel[ky * kw + kx];
            sum += in * wt;
        }
    }
    
    // Scale back to Q16.16 with RNE
    return q16_from_i64_rne(sum >> 16);
}

Accumulating in 64-bit prevents overflow during the convolution. The final shift returns to Q16.16 format.

For more on the mathematics, see Fixed-Point Neural Networks: The Math Behind Q16.16.

Safety Monitors

The c-from-scratch modules support fixed-point for embedded deployment:

// EMA update in Q16.16
void baseline_update_q16(baseline_q16_t *ctx, q16_t value) {
    // ema_new = alpha * value + (1 - alpha) * ema_old
    q16_t weighted_value = q16_mul(ctx->alpha, value);
    q16_t weighted_ema = q16_mul(ctx->one_minus_alpha, ctx->ema);
    ctx->ema = q16_add(weighted_value, weighted_ema);
}

The same algorithm, deterministic across all platforms, no FPU required.

Financial Calculations

For financial applications, larger integer parts may be needed:

// Q40.24 for large currency values with high precision
typedef int64_t money_t;  // 40 integer bits, 24 fractional

#define MONEY_SCALE (1LL << 24)
#define MONEY_MAX   (INT64_MAX >> 24)  // ≈ 549 trillion

money_t money_mul(money_t a, money_t b) {
    // Use 128-bit intermediate (compiler extension or manual)
    __int128 product = (__int128)a * b;
    return (money_t)(product >> 24);
}

This handles values up to hundreds of trillions with sub-cent precision - adequate for most financial applications while remaining deterministic.

Common Pitfalls

Silent Overflow

The most dangerous fixed-point bug:

q16_t large = 30000 << 16;  // 30,000.0
q16_t result = q16_mul(large, large);  // Overflow!

30,000 × 30,000 = 900,000,000, which exceeds Q16.16’s range of ±32,767. The result wraps to garbage.

Solution: Use saturation arithmetic or validate ranges before operations.

Precision Loss Accumulation

Each multiplication loses precision to rounding. In long computation chains, errors accumulate:

q16_t x = Q16_ONE;  // 1.0
for (int i = 0; i < 1000; i++) {
    x = q16_mul(x, Q16_FROM_FLOAT(0.999f));  // Tiny precision loss each iteration
}
// Final x may differ from expected by multiple LSBs

Solution: Use higher-precision intermediate formats, or analyse error bounds.

Division Quirks

Fixed-point division has surprising behaviour near zero:

q16_t small = 1;  // 0.000015 (smallest positive value)
q16_t result = q16_div(Q16_ONE, small);  // 1.0 / 0.000015 = 65,536
// But 65,536 > 32,767, so overflow!

Solution: Guard against division by small values, or use saturation.

Mixing Formats

Accidentally mixing Q8.24 and Q16.16 values produces garbage:

q8_24_t a = /* Q8.24 value */;
q16_16_t b = /* Q16.16 value */;
q16_16_t c = a + b;  // Wrong! Different scales!

Solution: Use strong types and conversion functions. Consider the approach in Fixed-Point Fundamentals.

Conclusion

Fixed-point arithmetic isn’t universally better than floating-point. It’s a trade-off:

Fixed-point offers: Determinism, reproducibility, analysability, no FPU requirement.

Fixed-point costs: Limited range, manual overflow handling, more development effort, less library support.

For safety-critical systems where certification requires deterministic behaviour and reproducible results, fixed-point is often the right choice. The costs are real but manageable, and the benefits - proving that the same code produces the same results everywhere - are essential.

For research, prototyping, and non-critical applications, floating-point’s convenience usually wins. The non-determinism rarely matters in practice.

The key is making a conscious choice based on requirements, not defaulting to floating-point because it’s familiar. When determinism matters more than dynamic range, fixed-point delivers.

The Fixed-Point Fundamentals course provides a complete introduction to fixed-point arithmetic. The certifiable-* ecosystem demonstrates production use in ML pipelines.

As with any engineering decision, context matters. Know what you need, understand the trade-offs, and choose accordingly.

When Fixed-Point Beats Floating-Point (And When It Doesn't)

What Fixed-Point Actually Is

Basic Operations

Why Floating-Point Is Non-Deterministic

Rounding Mode Variability

Extended Precision

Fused Multiply-Add

Associativity Failures

The Practical Impact

Why Fixed-Point Is Deterministic

Deterministic Rounding

No Special Values

The Range-Precision Trade-off

Q16.16 Limits

Choosing the Format

Overflow Handling

The Decision Framework

Use Fixed-Point When:

Use Floating-Point When:

Hybrid Approaches

Fixed-Point in Practice

Neural Network Inference

Safety Monitors

Financial Calculations

Common Pitfalls

Silent Overflow

Precision Loss Accumulation

Division Quirks

Mixing Formats

Conclusion

About the Author

Discuss This Perspective

When Fixed-Point Beats Floating-Point (And When It Doesn't)

What Fixed-Point Actually Is

Basic Operations

Why Floating-Point Is Non-Deterministic

Rounding Mode Variability

Extended Precision

Fused Multiply-Add

Associativity Failures

The Practical Impact

Why Fixed-Point Is Deterministic

Deterministic Rounding

No Special Values

The Range-Precision Trade-off

Q16.16 Limits

Choosing the Format

Overflow Handling

The Decision Framework

Use Fixed-Point When:

Use Floating-Point When:

Hybrid Approaches

Fixed-Point in Practice

Neural Network Inference

Safety Monitors

Financial Calculations

Common Pitfalls

Silent Overflow

Precision Loss Accumulation

Division Quirks

Mixing Formats

Conclusion

About the Author

Occasional Technical Updates

Discuss This Perspective