Most engineers learn floating-point arithmetic and never question it. IEEE 754 is convenient, widely supported, and “good enough” for most applications.
Until it isn’t.
When you need deterministic results — the same output for the same input, every time, on every platform — floating-point becomes a liability. When certification bodies ask you to prove your arithmetic is bounded, floating-point makes that proof difficult. When accumulated rounding errors cause your control system to drift, floating-point is the culprit.
Fixed-point arithmetic solves these problems. But most engineers never learned it properly.
The Problem with Floating-Point
Floating-point arithmetic has three fundamental issues for safety-critical systems:
1. Non-Determinism Across Platforms
The same floating-point code can produce different results on different hardware:
// This may give different answers on x87 vs SSE vs ARM
float result = a * b + c * d;The x87 FPU uses 80-bit extended precision internally. SSE uses 64-bit. ARM has its own quirks. Compiler flags like -ffast-math change behaviour. The “same” computation isn’t the same at all.
2. Accumulation Drift
Small rounding errors compound over time:
float sum = 0.0f;
for (int i = 0; i < 8640000; i++) { // 24 hours at 100Hz
sum += 0.01f;
}
// Expected: 86400.0
// Actual: ~87296.4 (error: ~1%)In a control system running for hours or days, this drift can cause real problems. The Patriot missile failure in 1991 was caused by exactly this kind of accumulated error — a 0.000000095 second timing drift that, over 100 hours, caused a 573-metre targeting error. Twenty-eight soldiers died.
3. Certification Challenges
Safety standards like DO-178C (aerospace), IEC 62304 (medical devices), and ISO 26262 (automotive) require you to prove bounds on your computations. With floating-point, proving worst-case behaviour is complex. The IEEE 754 standard alone is 58 pages of edge cases.
The Fixed-Point Solution
Fixed-point arithmetic uses integers with an implicit scale factor. A Q16.16 number, for example, uses 32 bits: 16 for the integer part, 16 for the fractional part. The scale factor is 2^16 = 65536.
// Q16.16: 16 integer bits, 16 fractional bits
typedef int32_t q16_16_t;
#define Q16_16_SCALE 65536
// Convert 3.14159 to Q16.16
q16_16_t pi = (q16_16_t)(3.14159 * Q16_16_SCALE); // = 205887
// The value 205887 represents 205887/65536 = 3.14158630...This representation gives you:
- Determinism — integer arithmetic is identical on every platform
- Bounded precision — you know exactly how precise your numbers are (1/65536 ≈ 0.000015 for Q16.16)
- Bounded range — you know exactly what values you can represent (±32767.99998 for Q16.16)
- No special hardware — every CPU handles integers identically
The trade-off is that you must choose your format carefully. Range and precision are in tension — more bits for the integer part means fewer bits for the fractional part.
Course Structure
Fixed-Point Fundamentals teaches this material systematically, from motivation through practical application:
| Lesson | Topic | What You’ll Learn |
|---|---|---|
| 00 | Why Not Float? | Real failures, platform divergence, accumulation drift |
| 01 | The Model | Q notation, implicit scaling, range vs precision trade-off |
| 02 | Arithmetic | The widening pattern: widen → compute → narrow |
| 03 | Safety | Saturation logic, sticky fault flags, avoiding undefined behaviour |
| 04 | Rounding | Truncation vs RNE, eliminating statistical bias |
| 05 | Conversion | Format rescaling, precision loss analysis |
| 06 | Patterns | Accumulators, lookup tables, mixed-precision PID |
| 07 | Strategy | Decision frameworks, format selection, certification bridge |
Each lesson includes working C99 code you can compile and run immediately.
The Core Pattern: Widen → Compute → Narrow
The most important technique in fixed-point arithmetic is the widening pattern. When you multiply two Q16.16 numbers, the intermediate result needs 64 bits to avoid overflow:
q16_16_t q16_mul(q16_16_t a, q16_16_t b) {
// Widen to 64 bits for the intermediate
int64_t wide = (int64_t)a * (int64_t)b;
// The product has 32 fractional bits (16 + 16)
// Shift right by 16 to get back to Q16.16
// Add half for round-to-nearest-even
wide += (1 << 15);
return (q16_16_t)(wide >> 16);
}Without widening, the multiplication would overflow silently. With widening, you have room for the full result before narrowing back to your target format.
This pattern — widen, compute, narrow — appears everywhere in fixed-point code. Master it and you’ve mastered half of fixed-point arithmetic.
Rounding: Why It Matters More Than You Think
When you narrow a result, you lose precision. How you handle that loss matters:
Truncation (round toward zero) introduces systematic bias. If you truncate repeatedly, errors accumulate in one direction.
Round-half-up (school rounding) has the same problem — it biases toward positive infinity.
Round-to-nearest-even (banker’s rounding) is statistically unbiased. When the value is exactly halfway between two representable numbers, it rounds to the nearest even number. Over many operations, the positive and negative roundings cancel out.
// Round-to-nearest-even for Q16.16 division
q16_16_t q16_div_rne(q16_16_t a, q16_16_t b) {
int64_t wide = ((int64_t)a << 16);
int64_t quotient = wide / b;
int64_t remainder = wide % b;
// Round to nearest even
int64_t half = (b > 0) ? b/2 : -b/2;
if (remainder > half ||
(remainder == half && (quotient & 1))) {
quotient += (b > 0) ? 1 : -1;
}
return (q16_16_t)quotient;
}Lesson 04 demonstrates this with a 1-million-operation test. Truncation drifts to zero. Round-half-up drifts positive. RNE stays centred.
Overflow: The Silent Killer
In C, signed integer overflow is undefined behaviour. The compiler is free to assume it never happens, which can lead to surprising optimisations that break your code.
Fixed-point code must handle overflow explicitly:
typedef struct {
q16_16_t value;
uint8_t flags; // Sticky fault flags
} q16_16_result_t;
#define FAULT_OVERFLOW 0x01
#define FAULT_UNDERFLOW 0x02
#define FAULT_SATURATED 0x04
q16_16_result_t q16_add_safe(q16_16_t a, q16_16_t b, uint8_t *flags) {
int64_t wide = (int64_t)a + (int64_t)b;
q16_16_result_t result;
if (wide > INT32_MAX) {
result.value = INT32_MAX;
*flags |= FAULT_OVERFLOW | FAULT_SATURATED;
} else if (wide < INT32_MIN) {
result.value = INT32_MIN;
*flags |= FAULT_UNDERFLOW | FAULT_SATURATED;
} else {
result.value = (q16_16_t)wide;
}
return result;
}The sticky fault flags pattern is crucial for safety-critical systems. You don’t check for overflow on every operation (too expensive). Instead, you clear the flags at the start of a computation pipeline, let them accumulate, and check once at the end. If any overflow occurred, you know about it.
Lesson 03 demonstrates this with a PID controller that experiences integral windup. Without saturation, the output wraps negative and the controller violently reverses — potentially causing physical damage in a real system.
Practical Patterns
Lesson 06 brings everything together with patterns you’ll use in real systems:
Mixed-Precision PID Controller
Different parts of a PID controller have different requirements:
typedef struct {
q8_24_t kp, ki, kd; // Coefficients: high precision, small range
q32_32_t integral; // State: wide range for accumulation
q16_16_t last_error; // State: general purpose
} pid_controller_t;
q16_16_t pid_update(pid_controller_t *pid, q16_16_t error, q16_16_t dt) {
// Proportional term
q32_32_t p_term = q_mul_q8_24_q16_16(pid->kp, error);
// Integral term (accumulate in Q32.32 to prevent overflow)
pid->integral = q32_add(pid->integral,
q_mul_q8_24_q16_16(pid->ki, q_mul(error, dt)));
// Derivative term
q16_16_t derivative = q_div(q_sub(error, pid->last_error), dt);
q32_32_t d_term = q_mul_q8_24_q16_16(pid->kd, derivative);
pid->last_error = error;
// Sum and convert to output format
q32_32_t output = q32_add(q32_add(p_term, pid->integral), d_term);
return q32_to_q16(output); // Saturate if needed
}Coefficients use Q8.24 (small values, high precision). The integral accumulator uses Q32.32 (wide range, prevents overflow). Inputs and outputs use Q16.16 (general purpose interface).
Sine Lookup Table with Linear Interpolation
When you can’t afford the cycles for CORDIC or polynomial approximation:
// 256-entry quarter-wave table
static const q16_16_t sine_table[257] = {
0x00000000, 0x00000648, 0x00000C8F, /* ... */
};
q16_16_t q16_sin(q16_16_t angle) {
// Reduce to [0, 2π)
angle = angle & 0x0000FFFF; // Assuming 2π = 0x10000
// Determine quadrant and index
uint32_t quadrant = (angle >> 14) & 0x3;
uint32_t index = (angle >> 6) & 0xFF;
uint32_t frac = angle & 0x3F;
// Lookup with linear interpolation
q16_16_t y0 = sine_table[index];
q16_16_t y1 = sine_table[index + 1];
q16_16_t result = y0 + (((y1 - y0) * frac) >> 6);
// Apply quadrant symmetry
// ...
return result;
}A 257-entry table (1KB) gives you better than 16-bit precision with simple linear interpolation. No floating-point transcendentals required.
The Certification Bridge
This course teaches the fundamentals with standalone, copy-paste-friendly code under the MIT license.
For production safety-critical systems, the certifiable-inference project provides:
| This Course | certifiable-* Ecosystem |
|---|---|
| Teaching implementations | Production implementations |
| Standalone examples | Ecosystem integration |
| MIT license | GPL + CLA for IP protection |
| ”Here’s how it works" | "Here’s proof it works” |
The certifiable-* ecosystem adds Merkle audit trails, cross-platform bit-identity verification, and documentation templates aligned with DO-178C, IEC 62304, and ISO 26262. If you’re building systems that need certification, that’s where you go after learning the fundamentals here.
Getting Started
git clone https://github.com/SpeyTech/fixed-point-fundamentals.git
cd fixed-point-fundamentals
makeEach lesson is self-contained. Start with Lesson 00 to understand why floating-point fails, or jump to the topic you need.
Prerequisites: C programming (comfortable with integers and bit operations), basic arithmetic, a C compiler. No external dependencies.
What You’ll Build
By the end of this course, you’ll be able to:
- Implement fixed-point arithmetic in strict C99
- Choose appropriate Q formats for your signal characteristics
- Handle overflow, underflow, and rounding correctly
- Build production-grade control systems with bounded, deterministic behaviour
- Understand the path from teaching implementations to certified production code
Reference Materials
The course includes formal specifications following the same methodology used in aerospace and medical device development:
- FPF-MATH-001 — Mathematical closure architecture
- FPF-STRUCT-001 — Data structure specification
Plus quick-reference materials:
- Q Formats — Common formats and their properties
- Common Pitfalls — Mistakes to avoid
- Further Reading — Where to go next
Related Reading
- Fixed-Point Neural Networks: The Math Behind Q16.16
- Round-to-Nearest-Even: The Rounding Mode That Makes Determinism Possible
- Why Floating Point Is Dangerous
- Bit-Perfect Reproducibility: Why It Matters and How to Prove It
Prove first, code second. MIT licensed.
As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context.