AI Architecture

WCET Analysis for Neural Network Inference

How to prove worst-case execution time for convolution, matrix multiply, and pooling operations

Published
January 15, 2026 22:31
Reading Time
13 min
WCET analysis showing loop bounds and execution time distribution for neural network operations

Safety-critical systems must respond within guaranteed time bounds. A pacemaker that usually delivers a signal on time is not acceptable. An aircraft flight control system that occasionally misses a deadline is not certifiable. The system must always meet its timing requirements, including under worst-case conditions.

Worst-Case Execution Time (WCET) analysis provides the mathematical foundation for these guarantees. By analysing code structure, loop bounds, and hardware behaviour, WCET analysis determines the maximum time a computation can take. This upper bound feeds into schedulability analysis, ensuring that all tasks complete before their deadlines.

Neural network inference presents specific challenges for WCET analysis. The computations are regular and predictable in structure, which is favourable. But they are also computationally intensive, making tight bounds essential for practical systems. Loose bounds waste capacity; tight bounds require careful analysis.

This article explains WCET analysis techniques applied to neural network operations: convolution, matrix multiplication, activation functions, and pooling. The goal is practical guidance for engineers who need to fill in the timing section of a certification package.

WCET Analysis Fundamentals

WCET analysis determines an upper bound on execution time without running the code on all possible inputs. Two complementary approaches exist:

Static analysis examines the code structure and hardware model to compute bounds mathematically. It requires no test execution but needs accurate models of both software and hardware.

Measurement-based analysis runs the code on representative inputs and measures execution time. It provides concrete data but cannot guarantee that the worst case was observed.

Hybrid approaches combine both: static analysis for structure, measurement for hardware timing. For certification, the analysis must be justified as sound, meaning the computed bound is genuinely an upper bound on all possible executions.

Loop Bounds

The foundation of static WCET analysis is determining how many times each loop executes. For neural network operations, loop bounds derive from tensor dimensions:

// Loop bounds are explicit in dimensions
for (int oh = 0; oh < out_height; oh++) {        // out_height iterations
    for (int ow = 0; ow < out_width; ow++) {     // out_width iterations
        for (int kh = 0; kh < kernel_height; kh++) {  // kernel_height iterations
            for (int kw = 0; kw < kernel_width; kw++) {  // kernel_width iterations
                // Inner computation
            }
        }
    }
}

Total iterations = out_height × out_width × kernel_height × kernel_width

If dimensions are compile-time constants, the loop bound is known statically. If dimensions vary at runtime, the analysis must use maximum values or prove tighter bounds from program context.

Basic Block Timing

Within loops, WCET analysis examines basic blocks: sequences of instructions with no branches. Each basic block has a timing bound based on instruction count and hardware characteristics:

// Basic block - no branches
acc += (int64_t)input[ih * in_w + iw] * (int64_t)kernel[kh * k_w + kw];

This line involves:

  • Two index calculations (multiply and add)
  • Two memory loads
  • One 64-bit multiply
  • One 64-bit add

The timing depends on the processor. On a simple in-order core, cycles can be summed. On a complex out-of-order core with caches, timing is harder to bound.

Data-Independent Timing

A critical property for WCET analysis is data-independent timing: execution time should not depend on data values. Operations with data-dependent timing complicate analysis:

// Data-dependent timing - problematic
if (value > threshold) {
    expensive_operation();
}

// Data-independent timing - preferred
result = (value > threshold) ? a : b;  // Both paths similar cost

Neural network operations with fixed-point arithmetic are naturally data-independent. Integer multiply takes the same time regardless of operand values on most processors. This property simplifies WCET analysis considerably.

Design Property: Timing Predictability

Code with data-independent timing and static loop bounds enables tight WCET analysis. The execution time is determined by code structure, not input values.

Analysing Convolution

2D convolution is typically the most expensive operation in CNN inference. Its nested loop structure makes WCET analysis straightforward in principle but requires attention to memory access patterns.

Loop Structure

Standard convolution has four nested loops:

void fx_conv2d(const fixed_t* input, const fixed_t* kernel, fixed_t* output,
               int in_h, int in_w, int k_h, int k_w) {
    int out_h = in_h - k_h + 1;
    int out_w = in_w - k_w + 1;
    
    for (int oh = 0; oh < out_h; oh++) {
        for (int ow = 0; ow < out_w; ow++) {
            int64_t acc = 0;
            for (int kh = 0; kh < k_h; kh++) {
                for (int kw = 0; kw < k_w; kw++) {
                    int ih = oh + kh;
                    int iw = ow + kw;
                    acc += (int64_t)input[ih * in_w + iw] * 
                           (int64_t)kernel[kh * k_w + kw];
                }
            }
            output[oh * out_w + ow] = (fixed_t)(acc >> 16);
        }
    }
}

Iteration Count

Total inner loop iterations:

N_iter = out_h × out_w × k_h × k_w
       = (in_h - k_h + 1) × (in_w - k_w + 1) × k_h × k_w

For a 16×16 input with 3×3 kernel:

N_iter = 14 × 14 × 3 × 3 = 1,764 iterations

Per-Iteration Cost

The inner loop body contains:

  • 2 additions (ih, iw calculation)
  • 2 multiplications (index calculations)
  • 2 memory loads (input, kernel)
  • 1 widening multiply (64-bit)
  • 1 addition (accumulator)

On a representative 32-bit microcontroller (ARM Cortex-M4):

OperationCycles (approx)
Integer add1
Integer multiply1
Memory load (cache hit)1-2
Memory load (cache miss)10-50+
64-bit multiply3-5
64-bit add1-2

Assuming cache hits, inner loop: ~10-15 cycles per iteration.

Total Bound

WCET_conv = N_iter × cycles_per_iter + loop_overhead
         = 1,764 × 15 + overhead
         ≈ 26,500 cycles

At 100 MHz: ~265 μs

This is a rough bound. Precise analysis requires:

  • Exact instruction sequence from compiled code
  • Processor pipeline model
  • Cache behaviour analysis
  • Memory access timing

Cache Considerations

Cache behaviour significantly affects WCET for convolution. The access pattern determines hit rates:

Kernel accesses: The kernel is small (9 elements for 3×3) and accessed repeatedly. After initial loads, kernel accesses hit cache consistently.

Input accesses: Input access pattern has spatial locality within rows but jumps between rows. For small inputs that fit in cache, this works well. For large inputs, cache misses occur at row boundaries.

Output accesses: Output is written sequentially, which is cache-friendly.

Conservative WCET analysis assumes cache misses. Tighter analysis models the access pattern to prove hit rates.

Analysing Matrix Multiplication

Dense layers use matrix multiplication. The analysis follows similar principles.

Loop Structure

void fx_matmul(const fixed_t* A, const fixed_t* B, fixed_t* C,
               int M, int K, int N) {
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            int64_t acc = 0;
            for (int k = 0; k < K; k++) {
                acc += (int64_t)A[i * K + k] * (int64_t)B[k * N + j];
            }
            C[i * N + j] = (fixed_t)(acc >> 16);
        }
    }
}

Iteration Count

N_iter = M × N × K

For a layer with 256 inputs and 128 outputs (M=1, K=256, N=128):

N_iter = 1 × 128 × 256 = 32,768 iterations

Memory Access Pattern

Matrix B is accessed with stride N, which can cause cache thrashing for large N. The standard mitigation is loop tiling, but this complicates WCET analysis by introducing additional loop bounds.

For safety-critical systems, simpler untiled implementations with conservative cache assumptions may be preferred over optimised implementations with complex timing behaviour.

Analysing Activation Functions

Activation functions apply element-wise operations. Their WCET is proportional to tensor size.

ReLU

void fx_relu(fixed_t* data, int size) {
    for (int i = 0; i < size; i++) {
        if (data[i] < 0) {
            data[i] = 0;
        }
    }
}

Both branches (value negative, value non-negative) have similar cost: one comparison, potentially one store. WCET analysis can use the maximum of both paths.

WCET_relu = size × max(cycles_negative, cycles_nonnegative)

Sigmoid and Tanh

Complex activation functions may use lookup tables or polynomial approximations:

fixed_t fx_sigmoid_approx(fixed_t x) {
    // Clamp to table range
    if (x < TABLE_MIN) return SIGMOID_MIN;
    if (x > TABLE_MAX) return SIGMOID_MAX;
    
    // Table lookup with interpolation
    int index = (x - TABLE_MIN) >> SHIFT;
    fixed_t frac = (x - TABLE_MIN) & MASK;
    
    fixed_t y0 = sigmoid_table[index];
    fixed_t y1 = sigmoid_table[index + 1];
    
    return y0 + fx_mul(frac, y1 - y0);
}

The branches create multiple paths. WCET is the maximum across all paths:

WCET_sigmoid = max(clamp_low_path, clamp_high_path, interpolation_path)

The interpolation path is typically longest, including table lookups and multiplication.

Analysing Pooling

Max pooling compares elements within windows. The comparison count is fixed by window size.

2×2 Max Pooling

void fx_maxpool_2x2(const fixed_t* input, fixed_t* output,
                    int in_h, int in_w) {
    int out_h = in_h / 2;
    int out_w = in_w / 2;
    
    for (int oh = 0; oh < out_h; oh++) {
        for (int ow = 0; ow < out_w; ow++) {
            int ih = oh * 2;
            int iw = ow * 2;
            
            fixed_t max_val = input[ih * in_w + iw];
            
            fixed_t v1 = input[ih * in_w + iw + 1];
            if (v1 > max_val) max_val = v1;
            
            fixed_t v2 = input[(ih + 1) * in_w + iw];
            if (v2 > max_val) max_val = v2;
            
            fixed_t v3 = input[(ih + 1) * in_w + iw + 1];
            if (v3 > max_val) max_val = v3;
            
            output[oh * out_w + ow] = max_val;
        }
    }
}

Iteration Count

N_iter = out_h × out_w = (in_h / 2) × (in_w / 2)

Per-Iteration Cost

Each iteration:

  • 4 memory loads
  • 3 comparisons
  • 3 conditional moves (or branches)
  • 1 memory store
  • Index calculations

The comparisons have data-dependent branches, but both outcomes have identical cost (conditional move or store to max_val). WCET is the same regardless of which element is maximum.

Composing Layer WCET

Network WCET is the sum of layer WCETs:

WCET_network = WCET_conv1 + WCET_relu1 + WCET_pool1 
             + WCET_conv2 + WCET_relu2 + WCET_pool2
             + WCET_dense + WCET_softmax

This assumes sequential execution. Pipelined or parallel implementations require more sophisticated analysis.

Example: Small CNN

Consider a simple CNN for image classification:

LayerDimensionsIterationsEst. Cycles
Conv 3×328×28→26×2660,840912,600
ReLU26×266762,028
MaxPool 2×226×26→13×131692,535
Conv 3×313×13→11×1110,890163,350
ReLU11×11121363
MaxPool 2×211×11→5×525375
Dense25→102503,750
Total~1,085,000

At 100 MHz: ~10.85 ms per inference.

This is a rough estimate. Actual WCET requires precise analysis of compiled code on the target processor.

Measurement-Based Verification

Static analysis provides bounds; measurement provides validation. The two should be consistent.

Test Methodology

void measure_wcet(void) {
    fixed_t input[INPUT_SIZE];
    fixed_t output[OUTPUT_SIZE];
    
    uint32_t max_cycles = 0;
    
    for (int trial = 0; trial < NUM_TRIALS; trial++) {
        // Generate test input
        generate_test_input(input, trial);
        
        // Measure execution time
        uint32_t start = get_cycle_count();
        inference(input, output);
        uint32_t end = get_cycle_count();
        
        uint32_t cycles = end - start;
        if (cycles > max_cycles) {
            max_cycles = cycles;
        }
    }
    
    printf("Measured max: %u cycles\n", max_cycles);
    printf("Static bound: %u cycles\n", STATIC_WCET_BOUND);
    
    // Measured should be less than static bound
    assert(max_cycles <= STATIC_WCET_BOUND);
}

Input Selection

For data-independent code, input values don’t affect timing. But testing should still cover:

  • Boundary values (zeros, maximum values)
  • Random values (for coverage)
  • Patterns that might affect cache behaviour

Statistical Analysis

Measurement results follow a distribution. Report:

  • Maximum observed time
  • 99th percentile
  • Standard deviation
  • Number of trials
Results (10,000 trials):
  Mean:    10,234 cycles
  StdDev:     127 cycles
  P99:     10,512 cycles
  Max:     10,847 cycles
  Static:  11,200 cycles (bound holds)

If measured maximum approaches the static bound, the analysis is tight. If there’s a large gap, either the static analysis is pessimistic or testing hasn’t found the worst case.

Practical Considerations

Compiler Effects

Compiler optimisation changes the instruction sequence. WCET analysis must use the actual compiled code, not the source:

# Compile with specific flags
arm-none-eabi-gcc -O2 -mcpu=cortex-m4 -o inference.o inference.c

# Disassemble for analysis
arm-none-eabi-objdump -d inference.o > inference.asm

Analysis tools work on object code or binary, not source.

Hardware Variability

Even deterministic code has timing variation from hardware:

  • Cache state at function entry
  • Pipeline state
  • Memory refresh cycles (DRAM)
  • Interrupt latency (if interrupts enabled)

WCET analysis must account for these. Conservative analysis assumes worst-case cache state. Precise analysis models cache contents.

Tooling

Commercial WCET analysis tools include:

  • AbsInt aiT
  • Rapita RapiTime
  • Bound-T

These tools combine static analysis with hardware models to compute bounds. For certification, tool qualification may be required.

For simpler systems, manual analysis with spreadsheets and measurement validation may suffice.

Documentation for Certification

DO-178C and similar standards require documented evidence. WCET documentation should include:

Analysis method: Static, measurement-based, or hybrid. Justification for soundness.

Assumptions: Processor model, clock speed, cache configuration, interrupt policy.

Loop bounds: How each loop bound was determined. Proof that bounds are correct.

Computed bounds: WCET for each function and the complete inference path.

Measurement data: Test methodology, number of trials, observed distribution.

Margin: Difference between computed WCET and deadline. Justification that margin is sufficient.

This documentation becomes part of the certification package, providing evidence that timing requirements are met.

Implementation Reference

The certifiable-inference project includes timing verification:

  • Loop bounds documented for all operations
  • Static allocation eliminates malloc timing variance
  • Measurement benchmarks for validation
  • Demonstrated <5% jitter at 95th percentile

The implementation is designed for analysability, with simple loop structures and data-independent timing.

Conclusion

WCET analysis for neural network inference follows established techniques: determine loop bounds, compute per-iteration costs, and sum across the network. The regularity of neural network operations makes them amenable to analysis, provided the implementation avoids data-dependent timing and dynamic allocation.

The key requirements are:

  • Static loop bounds derived from tensor dimensions
  • Data-independent timing from fixed-point arithmetic and branchless code
  • Simple memory patterns that enable cache analysis
  • Measurement validation confirming that static bounds hold

For safety-critical systems, WCET analysis is not optional. The timing section of a certification package must demonstrate that inference completes within its deadline under all conditions. The techniques in this article provide a foundation for that demonstration.

As with any certification effort, the analysis must be appropriate for the specific system, processor, and assurance level. Simple systems may use manual analysis; complex systems may require commercial tools. The goal is justified confidence that timing requirements are met.


For an implementation designed for WCET analysability, see certifiable-inference or try the live simulator.

About the Author

William Murray is a Regenerative Systems Architect with 30 years of UNIX infrastructure experience, specializing in deterministic computing for safety-critical systems. Based in the Scottish Highlands, he operates SpeyTech and maintains several open-source projects including C-Sentinel and c-from-scratch.

Let's Discuss Your AI Infrastructure

Available for UK-based consulting on production ML systems and infrastructure architecture.

Get in touch
← Back to AI Architecture