The certifiable-* ecosystem uses Q16.16 fixed-point arithmetic throughout. This isn’t because floating-point is inherently evil - it’s because the specific properties of fixed-point arithmetic align with the specific requirements of safety-critical systems.
But fixed-point has real costs. Limited range. Manual overflow handling. More development effort. For many applications, these costs outweigh the benefits.
This article provides an honest framework for deciding when fixed-point is worth it.
What Fixed-Point Actually Is
Fixed-point arithmetic represents fractional numbers using integers with an implicit scale factor. In Q16.16 format:
Raw integer value: 0x00018000 (98,304)
Interpreted value: 98,304 / 65,536 = 1.5The “Q16.16” notation means 16 bits for the integer part and 16 bits for the fractional part. A 32-bit signed integer represents values from approximately -32,768 to +32,767, with precision of 1/65,536 ≈ 0.000015.
Basic Operations
Addition and subtraction work directly on the raw integers:
// Q16.16 addition: just integer addition
q16_t q16_add(q16_t a, q16_t b) {
return a + b; // Same scale factor, direct addition
}
// Example: 1.5 + 2.25 = 3.75
// 98,304 + 147,456 = 245,760
// 245,760 / 65,536 = 3.75 ✓Multiplication requires a shift to maintain the scale:
// Q16.16 multiplication
q16_t q16_mul(q16_t a, q16_t b) {
int64_t product = (int64_t)a * (int64_t)b;
return (q16_t)(product >> 16); // Divide by scale factor
}
// Example: 1.5 × 2.0 = 3.0
// 98,304 × 131,072 = 12,884,901,888
// 12,884,901,888 >> 16 = 196,608
// 196,608 / 65,536 = 3.0 ✓Division is the inverse:
// Q16.16 division
q16_t q16_div(q16_t a, q16_t b) {
int64_t scaled = (int64_t)a << 16; // Multiply by scale factor
return (q16_t)(scaled / b);
}These operations are pure integer arithmetic. No floating-point unit required. No hardware-dependent rounding modes.
Why Floating-Point Is Non-Deterministic
IEEE 754 floating-point seems deterministic - the standard specifies exact behaviour. But the specification has gaps, and hardware implementations differ.
Rounding Mode Variability
The IEEE 754 standard defines five rounding modes:
- Round to nearest, ties to even (default)
- Round to nearest, ties away from zero
- Round toward zero
- Round toward positive infinity
- Round toward negative infinity
The default mode varies by platform. x86 uses “round to nearest, ties to even” while some ARM implementations use “round to nearest, ties away from zero” in certain configurations. Same computation, different results.
Extended Precision
x87 floating-point (used on older x86 processors) uses 80-bit extended precision internally, then rounds to 64-bit for storage. This means:
double a = /* some value */;
double b = /* some value */;
double c = a * b; // Computed in 80-bit, stored as 64-bit
// vs.
double temp = a * b; // Computed in 80-bit, stored as 64-bit
double c = temp; // May or may not reload from memoryThe compiler’s register allocation decisions affect numerical results. Same source code, different binaries, different results.
Fused Multiply-Add
Modern processors support FMA (fused multiply-add) instructions that compute a * b + c with a single rounding at the end, rather than rounding after the multiply and again after the add.
double fma_result = a * b + c; // Might use FMA (one rounding)
double separate = (a * b) + c; // Might not (two roundings)Whether the compiler uses FMA depends on optimisation level, target architecture, and compiler version. Same expression, different binary, different result.
Associativity Failures
Floating-point addition is not associative:
double a = 1e-16;
double b = 1.0;
double c = -1.0;
double x = (a + b) + c; // = 0.0 (1e-16 absorbed by 1.0)
double y = a + (b + c); // = 1e-16 (b + c = 0, then add a)When compilers reorder operations for optimisation, results change. Parallel reduction (summing array elements across multiple threads) is particularly vulnerable - the order of operations affects the result.
The Practical Impact
For most applications, these differences are negligible - they affect the least significant bits. But for safety-critical systems:
- Different platforms produce different results
- The same platform may produce different results across compiler versions
- Test results on development machines don’t guarantee production behaviour
- Debugging is harder when results aren’t reproducible
This is why fixed-point matters.
Why Fixed-Point Is Deterministic
Fixed-point arithmetic uses only integer operations, which are fully deterministic across all platforms:
// Integer addition: always the same
int32_t a = 100;
int32_t b = 200;
int32_t c = a + b; // Always 300, everywhere, foreverDeterministic Rounding
The one place where fixed-point requires decisions is the shift in multiplication:
int64_t product = (int64_t)a * (int64_t)b;
return (q16_t)(product >> 16); // TruncationSimple right-shift truncates toward negative infinity. This is deterministic but introduces bias. The solution is explicit round-to-nearest-even (RNE):
q16_t q16_mul_rne(q16_t a, q16_t b) {
int64_t product = (int64_t)a * (int64_t)b;
// Round to nearest, ties to even
int64_t round_bit = (product >> 15) & 1;
int64_t guard_bit = (product >> 14) & 1;
int64_t sticky = (product & 0x3FFF) != 0;
int64_t round_up = round_bit & (guard_bit | sticky);
return (q16_t)((product >> 16) + round_up);
}This is more code than truncation, but it’s pure integer arithmetic, fully deterministic, and produces the same result on every platform.
For details on why RNE matters, see Round-to-Nearest-Even: The Rounding Mode That Makes Determinism Possible.
No Special Values
Floating-point has special values: positive/negative infinity, positive/negative zero, NaN (Not a Number), denormalised numbers. Each creates edge cases:
double x = 0.0 / 0.0; // NaN
double y = 1.0 / 0.0; // Infinity
double z = x == x; // false (NaN != NaN)Fixed-point has none of this. Overflow produces defined (if incorrect) results. Division by zero is undefined behaviour that can be guarded against. There’s no NaN to propagate silently through computations.
The Range-Precision Trade-off
Fixed-point’s determinism comes at a cost: limited range.
Q16.16 Limits
Maximum value: 32,767.999985 (0x7FFFFFFF)
Minimum value: -32,768.0 (0x80000000)
Precision: 0.000015 (1/65536)For neural network weights typically in the range [-2, 2], this is ample precision. For financial calculations involving millions of dollars, it’s insufficient.
Choosing the Format
Different applications need different trade-offs:
| Format | Integer Bits | Fraction Bits | Range | Precision |
|---|---|---|---|---|
| Q8.24 | 8 | 24 | ±127 | 0.00000006 |
| Q16.16 | 16 | 16 | ±32,767 | 0.000015 |
| Q24.8 | 24 | 8 | ±8,388,607 | 0.004 |
| Q31.1 | 31 | 1 | ±1,073,741,823 | 0.5 |
High precision, low range (Q8.24): Signal processing, audio, small neural networks with normalised weights.
Balanced (Q16.16): General-purpose, neural network inference, control systems.
High range, low precision (Q24.8): Financial calculations with large integers, counters, timestamps.
The format choice is a design decision that must be made upfront and documented.
Overflow Handling
Fixed-point overflow wraps silently:
q16_t max = 0x7FFFFFFF; // 32,767.999985
q16_t result = max + 1; // 0x80000000 = -32,768.0 (wrapped!)This must be handled explicitly:
q16_t q16_add_saturate(q16_t a, q16_t b) {
int64_t sum = (int64_t)a + (int64_t)b;
if (sum > INT32_MAX) return INT32_MAX;
if (sum < INT32_MIN) return INT32_MIN;
return (q16_t)sum;
}Saturation arithmetic clamps to the maximum/minimum rather than wrapping. This is often the right behaviour for control systems - better to hit a limit than wrap to the opposite extreme.
The Decision Framework
Here’s when to choose fixed-point:
Use Fixed-Point When:
Determinism is required. Safety-critical systems, certified software, reproducible research, bit-exact testing across platforms.
Certification is involved. DO-178C, IEC 62304, ISO 26262 all favour analysable arithmetic. Fixed-point’s bounded, deterministic behaviour is easier to certify.
No FPU is available. Embedded systems, microcontrollers, and some safety-critical processors lack floating-point hardware.
Cross-platform reproducibility matters. If test results must match production exactly, floating-point is risky.
Values are bounded. If you know your values stay within ±32,767 (or whatever format you choose), fixed-point works well.
Use Floating-Point When:
Dynamic range is large. Values spanning 10⁻³⁰ to 10³⁰ require floating-point’s exponential representation.
Precision varies by magnitude. Floating-point maintains relative precision across scales; fixed-point has constant absolute precision.
Performance on GPU matters. GPU floating-point is highly optimised; fixed-point often requires workarounds.
The application isn’t safety-critical. For research, prototyping, and non-critical applications, floating-point’s convenience outweighs its non-determinism.
Library support is needed. Scientific computing libraries assume floating-point. Translating to fixed-point may not be worth the effort.
Hybrid Approaches
Many systems use both:
Training in floating-point, inference in fixed-point. Research and development use convenient floating-point. Deployment converts to deterministic fixed-point. This is the approach of certifiable-quant.
Critical path in fixed-point, peripherals in floating-point. The safety-critical control loop uses deterministic fixed-point. Logging, visualisation, and non-critical features use floating-point.
Fixed-point with floating-point validation. Implement in fixed-point, validate against a floating-point reference to catch overflow issues.
Fixed-Point in Practice
Neural Network Inference
The certifiable-inference project demonstrates fixed-point neural networks. The key insight: neural network weights and activations are typically small values (after normalisation), fitting comfortably in Q16.16 range.
// Convolution in Q16.16
q16_t conv2d_output(const q16_t *input, const q16_t *kernel,
int x, int y, int kw, int kh) {
int64_t sum = 0; // Accumulate in 64-bit to avoid overflow
for (int ky = 0; ky < kh; ky++) {
for (int kx = 0; kx < kw; kx++) {
int64_t in = input[(y + ky) * width + (x + kx)];
int64_t wt = kernel[ky * kw + kx];
sum += in * wt;
}
}
// Scale back to Q16.16 with RNE
return q16_from_i64_rne(sum >> 16);
}Accumulating in 64-bit prevents overflow during the convolution. The final shift returns to Q16.16 format.
For more on the mathematics, see Fixed-Point Neural Networks: The Math Behind Q16.16.
Safety Monitors
The c-from-scratch modules support fixed-point for embedded deployment:
// EMA update in Q16.16
void baseline_update_q16(baseline_q16_t *ctx, q16_t value) {
// ema_new = alpha * value + (1 - alpha) * ema_old
q16_t weighted_value = q16_mul(ctx->alpha, value);
q16_t weighted_ema = q16_mul(ctx->one_minus_alpha, ctx->ema);
ctx->ema = q16_add(weighted_value, weighted_ema);
}The same algorithm, deterministic across all platforms, no FPU required.
Financial Calculations
For financial applications, larger integer parts may be needed:
// Q40.24 for large currency values with high precision
typedef int64_t money_t; // 40 integer bits, 24 fractional
#define MONEY_SCALE (1LL << 24)
#define MONEY_MAX (INT64_MAX >> 24) // ≈ 549 trillion
money_t money_mul(money_t a, money_t b) {
// Use 128-bit intermediate (compiler extension or manual)
__int128 product = (__int128)a * b;
return (money_t)(product >> 24);
}This handles values up to hundreds of trillions with sub-cent precision - adequate for most financial applications while remaining deterministic.
Common Pitfalls
Silent Overflow
The most dangerous fixed-point bug:
q16_t large = 30000 << 16; // 30,000.0
q16_t result = q16_mul(large, large); // Overflow!30,000 × 30,000 = 900,000,000, which exceeds Q16.16’s range of ±32,767. The result wraps to garbage.
Solution: Use saturation arithmetic or validate ranges before operations.
Precision Loss Accumulation
Each multiplication loses precision to rounding. In long computation chains, errors accumulate:
q16_t x = Q16_ONE; // 1.0
for (int i = 0; i < 1000; i++) {
x = q16_mul(x, Q16_FROM_FLOAT(0.999f)); // Tiny precision loss each iteration
}
// Final x may differ from expected by multiple LSBsSolution: Use higher-precision intermediate formats, or analyse error bounds.
Division Quirks
Fixed-point division has surprising behaviour near zero:
q16_t small = 1; // 0.000015 (smallest positive value)
q16_t result = q16_div(Q16_ONE, small); // 1.0 / 0.000015 = 65,536
// But 65,536 > 32,767, so overflow!Solution: Guard against division by small values, or use saturation.
Mixing Formats
Accidentally mixing Q8.24 and Q16.16 values produces garbage:
q8_24_t a = /* Q8.24 value */;
q16_16_t b = /* Q16.16 value */;
q16_16_t c = a + b; // Wrong! Different scales!Solution: Use strong types and conversion functions. Consider the approach in Fixed-Point Fundamentals.
Conclusion
Fixed-point arithmetic isn’t universally better than floating-point. It’s a trade-off:
Fixed-point offers: Determinism, reproducibility, analysability, no FPU requirement.
Fixed-point costs: Limited range, manual overflow handling, more development effort, less library support.
For safety-critical systems where certification requires deterministic behaviour and reproducible results, fixed-point is often the right choice. The costs are real but manageable, and the benefits - proving that the same code produces the same results everywhere - are essential.
For research, prototyping, and non-critical applications, floating-point’s convenience usually wins. The non-determinism rarely matters in practice.
The key is making a conscious choice based on requirements, not defaulting to floating-point because it’s familiar. When determinism matters more than dynamic range, fixed-point delivers.
The Fixed-Point Fundamentals course provides a complete introduction to fixed-point arithmetic. The certifiable-* ecosystem demonstrates production use in ML pipelines.
As with any engineering decision, context matters. Know what you need, understand the trade-offs, and choose accordingly.