Safety-critical systems must respond within guaranteed time bounds. A pacemaker that usually delivers a signal on time is not acceptable. An aircraft flight control system that occasionally misses a deadline is not certifiable. The system must always meet its timing requirements, including under worst-case conditions.
Worst-Case Execution Time (WCET) analysis provides the mathematical foundation for these guarantees. By analysing code structure, loop bounds, and hardware behaviour, WCET analysis determines the maximum time a computation can take. This upper bound feeds into schedulability analysis, ensuring that all tasks complete before their deadlines.
Neural network inference presents specific challenges for WCET analysis. The computations are regular and predictable in structure, which is favourable. But they are also computationally intensive, making tight bounds essential for practical systems. Loose bounds waste capacity; tight bounds require careful analysis.
This article explains WCET analysis techniques applied to neural network operations: convolution, matrix multiplication, activation functions, and pooling. The goal is practical guidance for engineers who need to fill in the timing section of a certification package.
WCET Analysis Fundamentals
WCET analysis determines an upper bound on execution time without running the code on all possible inputs. Two complementary approaches exist:
Static analysis examines the code structure and hardware model to compute bounds mathematically. It requires no test execution but needs accurate models of both software and hardware.
Measurement-based analysis runs the code on representative inputs and measures execution time. It provides concrete data but cannot guarantee that the worst case was observed.
Hybrid approaches combine both: static analysis for structure, measurement for hardware timing. For certification, the analysis must be justified as sound, meaning the computed bound is genuinely an upper bound on all possible executions.
Loop Bounds
The foundation of static WCET analysis is determining how many times each loop executes. For neural network operations, loop bounds derive from tensor dimensions:
// Loop bounds are explicit in dimensions
for (int oh = 0; oh < out_height; oh++) { // out_height iterations
for (int ow = 0; ow < out_width; ow++) { // out_width iterations
for (int kh = 0; kh < kernel_height; kh++) { // kernel_height iterations
for (int kw = 0; kw < kernel_width; kw++) { // kernel_width iterations
// Inner computation
}
}
}
}Total iterations = out_height × out_width × kernel_height × kernel_width
If dimensions are compile-time constants, the loop bound is known statically. If dimensions vary at runtime, the analysis must use maximum values or prove tighter bounds from program context.
Basic Block Timing
Within loops, WCET analysis examines basic blocks: sequences of instructions with no branches. Each basic block has a timing bound based on instruction count and hardware characteristics:
// Basic block - no branches
acc += (int64_t)input[ih * in_w + iw] * (int64_t)kernel[kh * k_w + kw];This line involves:
- Two index calculations (multiply and add)
- Two memory loads
- One 64-bit multiply
- One 64-bit add
The timing depends on the processor. On a simple in-order core, cycles can be summed. On a complex out-of-order core with caches, timing is harder to bound.
Data-Independent Timing
A critical property for WCET analysis is data-independent timing: execution time should not depend on data values. Operations with data-dependent timing complicate analysis:
// Data-dependent timing - problematic
if (value > threshold) {
expensive_operation();
}
// Data-independent timing - preferred
result = (value > threshold) ? a : b; // Both paths similar costNeural network operations with fixed-point arithmetic are naturally data-independent. Integer multiply takes the same time regardless of operand values on most processors. This property simplifies WCET analysis considerably.
Code with data-independent timing and static loop bounds enables tight WCET analysis. The execution time is determined by code structure, not input values.
Analysing Convolution
2D convolution is typically the most expensive operation in CNN inference. Its nested loop structure makes WCET analysis straightforward in principle but requires attention to memory access patterns.
Loop Structure
Standard convolution has four nested loops:
void fx_conv2d(const fixed_t* input, const fixed_t* kernel, fixed_t* output,
int in_h, int in_w, int k_h, int k_w) {
int out_h = in_h - k_h + 1;
int out_w = in_w - k_w + 1;
for (int oh = 0; oh < out_h; oh++) {
for (int ow = 0; ow < out_w; ow++) {
int64_t acc = 0;
for (int kh = 0; kh < k_h; kh++) {
for (int kw = 0; kw < k_w; kw++) {
int ih = oh + kh;
int iw = ow + kw;
acc += (int64_t)input[ih * in_w + iw] *
(int64_t)kernel[kh * k_w + kw];
}
}
output[oh * out_w + ow] = (fixed_t)(acc >> 16);
}
}
}Iteration Count
Total inner loop iterations:
N_iter = out_h × out_w × k_h × k_w
= (in_h - k_h + 1) × (in_w - k_w + 1) × k_h × k_wFor a 16×16 input with 3×3 kernel:
N_iter = 14 × 14 × 3 × 3 = 1,764 iterationsPer-Iteration Cost
The inner loop body contains:
- 2 additions (ih, iw calculation)
- 2 multiplications (index calculations)
- 2 memory loads (input, kernel)
- 1 widening multiply (64-bit)
- 1 addition (accumulator)
On a representative 32-bit microcontroller (ARM Cortex-M4):
| Operation | Cycles (approx) |
|---|---|
| Integer add | 1 |
| Integer multiply | 1 |
| Memory load (cache hit) | 1-2 |
| Memory load (cache miss) | 10-50+ |
| 64-bit multiply | 3-5 |
| 64-bit add | 1-2 |
Assuming cache hits, inner loop: ~10-15 cycles per iteration.
Total Bound
WCET_conv = N_iter × cycles_per_iter + loop_overhead
= 1,764 × 15 + overhead
≈ 26,500 cyclesAt 100 MHz: ~265 μs
This is a rough bound. Precise analysis requires:
- Exact instruction sequence from compiled code
- Processor pipeline model
- Cache behaviour analysis
- Memory access timing
Cache Considerations
Cache behaviour significantly affects WCET for convolution. The access pattern determines hit rates:
Kernel accesses: The kernel is small (9 elements for 3×3) and accessed repeatedly. After initial loads, kernel accesses hit cache consistently.
Input accesses: Input access pattern has spatial locality within rows but jumps between rows. For small inputs that fit in cache, this works well. For large inputs, cache misses occur at row boundaries.
Output accesses: Output is written sequentially, which is cache-friendly.
Conservative WCET analysis assumes cache misses. Tighter analysis models the access pattern to prove hit rates.
Analysing Matrix Multiplication
Dense layers use matrix multiplication. The analysis follows similar principles.
Loop Structure
void fx_matmul(const fixed_t* A, const fixed_t* B, fixed_t* C,
int M, int K, int N) {
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
int64_t acc = 0;
for (int k = 0; k < K; k++) {
acc += (int64_t)A[i * K + k] * (int64_t)B[k * N + j];
}
C[i * N + j] = (fixed_t)(acc >> 16);
}
}
}Iteration Count
N_iter = M × N × KFor a layer with 256 inputs and 128 outputs (M=1, K=256, N=128):
N_iter = 1 × 128 × 256 = 32,768 iterationsMemory Access Pattern
Matrix B is accessed with stride N, which can cause cache thrashing for large N. The standard mitigation is loop tiling, but this complicates WCET analysis by introducing additional loop bounds.
For safety-critical systems, simpler untiled implementations with conservative cache assumptions may be preferred over optimised implementations with complex timing behaviour.
Analysing Activation Functions
Activation functions apply element-wise operations. Their WCET is proportional to tensor size.
ReLU
void fx_relu(fixed_t* data, int size) {
for (int i = 0; i < size; i++) {
if (data[i] < 0) {
data[i] = 0;
}
}
}Both branches (value negative, value non-negative) have similar cost: one comparison, potentially one store. WCET analysis can use the maximum of both paths.
WCET_relu = size × max(cycles_negative, cycles_nonnegative)Sigmoid and Tanh
Complex activation functions may use lookup tables or polynomial approximations:
fixed_t fx_sigmoid_approx(fixed_t x) {
// Clamp to table range
if (x < TABLE_MIN) return SIGMOID_MIN;
if (x > TABLE_MAX) return SIGMOID_MAX;
// Table lookup with interpolation
int index = (x - TABLE_MIN) >> SHIFT;
fixed_t frac = (x - TABLE_MIN) & MASK;
fixed_t y0 = sigmoid_table[index];
fixed_t y1 = sigmoid_table[index + 1];
return y0 + fx_mul(frac, y1 - y0);
}The branches create multiple paths. WCET is the maximum across all paths:
WCET_sigmoid = max(clamp_low_path, clamp_high_path, interpolation_path)The interpolation path is typically longest, including table lookups and multiplication.
Analysing Pooling
Max pooling compares elements within windows. The comparison count is fixed by window size.
2×2 Max Pooling
void fx_maxpool_2x2(const fixed_t* input, fixed_t* output,
int in_h, int in_w) {
int out_h = in_h / 2;
int out_w = in_w / 2;
for (int oh = 0; oh < out_h; oh++) {
for (int ow = 0; ow < out_w; ow++) {
int ih = oh * 2;
int iw = ow * 2;
fixed_t max_val = input[ih * in_w + iw];
fixed_t v1 = input[ih * in_w + iw + 1];
if (v1 > max_val) max_val = v1;
fixed_t v2 = input[(ih + 1) * in_w + iw];
if (v2 > max_val) max_val = v2;
fixed_t v3 = input[(ih + 1) * in_w + iw + 1];
if (v3 > max_val) max_val = v3;
output[oh * out_w + ow] = max_val;
}
}
}Iteration Count
N_iter = out_h × out_w = (in_h / 2) × (in_w / 2)Per-Iteration Cost
Each iteration:
- 4 memory loads
- 3 comparisons
- 3 conditional moves (or branches)
- 1 memory store
- Index calculations
The comparisons have data-dependent branches, but both outcomes have identical cost (conditional move or store to max_val). WCET is the same regardless of which element is maximum.
Composing Layer WCET
Network WCET is the sum of layer WCETs:
WCET_network = WCET_conv1 + WCET_relu1 + WCET_pool1
+ WCET_conv2 + WCET_relu2 + WCET_pool2
+ WCET_dense + WCET_softmaxThis assumes sequential execution. Pipelined or parallel implementations require more sophisticated analysis.
Example: Small CNN
Consider a simple CNN for image classification:
| Layer | Dimensions | Iterations | Est. Cycles |
|---|---|---|---|
| Conv 3×3 | 28×28→26×26 | 60,840 | 912,600 |
| ReLU | 26×26 | 676 | 2,028 |
| MaxPool 2×2 | 26×26→13×13 | 169 | 2,535 |
| Conv 3×3 | 13×13→11×11 | 10,890 | 163,350 |
| ReLU | 11×11 | 121 | 363 |
| MaxPool 2×2 | 11×11→5×5 | 25 | 375 |
| Dense | 25→10 | 250 | 3,750 |
| Total | ~1,085,000 |
At 100 MHz: ~10.85 ms per inference.
This is a rough estimate. Actual WCET requires precise analysis of compiled code on the target processor.
Measurement-Based Verification
Static analysis provides bounds; measurement provides validation. The two should be consistent.
Test Methodology
void measure_wcet(void) {
fixed_t input[INPUT_SIZE];
fixed_t output[OUTPUT_SIZE];
uint32_t max_cycles = 0;
for (int trial = 0; trial < NUM_TRIALS; trial++) {
// Generate test input
generate_test_input(input, trial);
// Measure execution time
uint32_t start = get_cycle_count();
inference(input, output);
uint32_t end = get_cycle_count();
uint32_t cycles = end - start;
if (cycles > max_cycles) {
max_cycles = cycles;
}
}
printf("Measured max: %u cycles\n", max_cycles);
printf("Static bound: %u cycles\n", STATIC_WCET_BOUND);
// Measured should be less than static bound
assert(max_cycles <= STATIC_WCET_BOUND);
}Input Selection
For data-independent code, input values don’t affect timing. But testing should still cover:
- Boundary values (zeros, maximum values)
- Random values (for coverage)
- Patterns that might affect cache behaviour
Statistical Analysis
Measurement results follow a distribution. Report:
- Maximum observed time
- 99th percentile
- Standard deviation
- Number of trials
Results (10,000 trials):
Mean: 10,234 cycles
StdDev: 127 cycles
P99: 10,512 cycles
Max: 10,847 cycles
Static: 11,200 cycles (bound holds)If measured maximum approaches the static bound, the analysis is tight. If there’s a large gap, either the static analysis is pessimistic or testing hasn’t found the worst case.
Practical Considerations
Compiler Effects
Compiler optimisation changes the instruction sequence. WCET analysis must use the actual compiled code, not the source:
# Compile with specific flags
arm-none-eabi-gcc -O2 -mcpu=cortex-m4 -o inference.o inference.c
# Disassemble for analysis
arm-none-eabi-objdump -d inference.o > inference.asmAnalysis tools work on object code or binary, not source.
Hardware Variability
Even deterministic code has timing variation from hardware:
- Cache state at function entry
- Pipeline state
- Memory refresh cycles (DRAM)
- Interrupt latency (if interrupts enabled)
WCET analysis must account for these. Conservative analysis assumes worst-case cache state. Precise analysis models cache contents.
Tooling
Commercial WCET analysis tools include:
- AbsInt aiT
- Rapita RapiTime
- Bound-T
These tools combine static analysis with hardware models to compute bounds. For certification, tool qualification may be required.
For simpler systems, manual analysis with spreadsheets and measurement validation may suffice.
Documentation for Certification
DO-178C and similar standards require documented evidence. WCET documentation should include:
Analysis method: Static, measurement-based, or hybrid. Justification for soundness.
Assumptions: Processor model, clock speed, cache configuration, interrupt policy.
Loop bounds: How each loop bound was determined. Proof that bounds are correct.
Computed bounds: WCET for each function and the complete inference path.
Measurement data: Test methodology, number of trials, observed distribution.
Margin: Difference between computed WCET and deadline. Justification that margin is sufficient.
This documentation becomes part of the certification package, providing evidence that timing requirements are met.
Implementation Reference
The certifiable-inference project includes timing verification:
- Loop bounds documented for all operations
- Static allocation eliminates malloc timing variance
- Measurement benchmarks for validation
- Demonstrated <5% jitter at 95th percentile
The implementation is designed for analysability, with simple loop structures and data-independent timing.
Conclusion
WCET analysis for neural network inference follows established techniques: determine loop bounds, compute per-iteration costs, and sum across the network. The regularity of neural network operations makes them amenable to analysis, provided the implementation avoids data-dependent timing and dynamic allocation.
The key requirements are:
- Static loop bounds derived from tensor dimensions
- Data-independent timing from fixed-point arithmetic and branchless code
- Simple memory patterns that enable cache analysis
- Measurement validation confirming that static bounds hold
For safety-critical systems, WCET analysis is not optional. The timing section of a certification package must demonstrate that inference completes within its deadline under all conditions. The techniques in this article provide a foundation for that demonstration.
As with any certification effort, the analysis must be appropriate for the specific system, processor, and assurance level. Simple systems may use manual analysis; complex systems may require commercial tools. The goal is justified confidence that timing requirements are met.
For an implementation designed for WCET analysability, see certifiable-inference or try the live simulator.